summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-10-24selftests/bpf: fix test_spin_lock_fail.c's global vars usageAndrii Nakryiko1-2/+2
Global variables of special types (like `struct bpf_spin_lock`) make underlying ARRAY maps non-mmapable. To make this work with libbpf's mmaping logic, application is expected to declare such special variables as static, so libbpf doesn't even attempt to mmap() such ARRAYs. test_spin_lock_fail.c didn't follow this rule, but given it relied on this test to trigger failures, this went unnoticed, as we never got to the step of mmap()'ing these ARRAY maps. It is fragile and relies on specific sequence of libbpf steps, which are an internal implementation details. Fix the test by marking lockA and lockB as static. Fixes: c48748aea4f8 ("selftests/bpf: Add failure test cases for spin lock pairing") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241023043908.3834423-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-24Merge branch 'fix-wmaybe-uninitialized-warnings-errors'Andrii Nakryiko3-5/+5
Eder Zulian says: ==================== Fix -Wmaybe-uninitialized warnings/errors Hello! This v2 series initializes the variables 'set' and 'set8' in sets_patch to NULL, along with the variables 'new_off' and 'pad_bits' and 'pad_type' in btf_dump_emit_bit_padding to zero or NULL according to their types and the variable 'o' in options__order to NULL to prevent compiler warnings/errors which are observed when compiling with non-default compilation options, but are not emitted by the compiler with the current default compilation options. - tools/bpf/resolve_btfids/main.c: Initialize the variables 'set' and 'set8' in sets_patch to NULL. - tools/lib/bpf/btf_dump.c: Initialize the variables 'new_off' and 'pad_bits' and 'pad_type' in btf_dump_emit_bit_padding to zero/NULL - tools/lib/subcmd/parse-options.c: Initialize the variable 'o' in options__order to NULL. Sam James mentioned that Michael Weiß had previously sent an alternative patch as https://lore.kernel.org/all/20240731085217.94928-1-michael.weiss@aisec.fraunhofer.de/ Tested on x86_64 with clang version 17.0.6 and gcc (GCC) 13.3.1. $ for c in gcc clang; do for o in fast g s z $(seq 0 3); do make -C \ tools/bpf/resolve_btfids/ HOST_CC=${c} "HOSTCFLAGS=-O${o} -Wall" \ clean all 2>&1 | tee ${c}-O${o}.out; done; done && \ grep 'warning:\|error:' *.out [...] clang-O1.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] clang-O1.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] clang-O2.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] clang-O2.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] clang-O3.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] clang-O3.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] clang-Ofast.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] clang-Ofast.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] clang-Og.out:btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized] clang-Og.out:btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized] clang-Og.out:btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized] clang-Os.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized] clang-Oz.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized] gcc-O1.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] gcc-O1.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] gcc-O2.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] gcc-O2.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] gcc-O3.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] gcc-O3.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] gcc-Ofast.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] gcc-Ofast.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] gcc-Og.out:btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized] gcc-Og.out:btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized] gcc-Og.out:btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized] gcc-Os.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized] gcc-Oz.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized] The above warnings and/or errors are fixed. However, they are observed with current default compilation options. Updates since v1: - Incorporate feedback from reviewers. Add a comment about an alternative patch for parse-options.c sent before (based on comments from Sam James.) Split in multiple patches creating this series and a typo was fixed "Initiazlide" -> "Initialize" (suggested by Viktor Malik). State more clearly that the -Wmaybe-uninitialized issues only happen when compiling with non-default compilation options (based on comments from Yonghong Song.) Thanks, ==================== Link: https://lore.kernel.org/r/20241022172329.3871958-1-ezulian@redhat.com Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-10-24libsubcmd: Silence compiler warningEder Zulian1-1/+1
Initialize the pointer 'o' in options__order to NULL to prevent a compiler warning/error which is observed when compiling with the '-Og' option, but is not emitted by the compiler with the current default compilation options. For example, when compiling libsubcmd with $ make "EXTRA_CFLAGS=-Og" -C tools/lib/subcmd/ clean all Clang version 17.0.6 and GCC 13.3.1 fail to compile parse-options.c due to following error: parse-options.c: In function ‘options__order’: parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized] 832 | memcpy(&ordered[nr_opts], o, sizeof(*o)); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ parse-options.c:810:30: note: ‘o’ was declared here 810 | const struct option *o, *p = opts; | ^ cc1: all warnings being treated as errors Signed-off-by: Eder Zulian <ezulian@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022172329.3871958-4-ezulian@redhat.com
2024-10-24libbpf: Prevent compiler warnings/errorsEder Zulian1-2/+2
Initialize 'new_off' and 'pad_bits' to 0 and 'pad_type' to NULL in btf_dump_emit_bit_padding to prevent compiler warnings/errors which are observed when compiling with 'EXTRA_CFLAGS=-g -Og' options, but do not happen when compiling with current default options. For example, when compiling libbpf with $ make "EXTRA_CFLAGS=-g -Og" -C tools/lib/bpf/ clean all Clang version 17.0.6 and GCC 13.3.1 fail to compile btf_dump.c due to following errors: btf_dump.c: In function ‘btf_dump_emit_bit_padding’: btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized] 903 | if (new_off > cur_off && new_off <= next_off) { | ~~~~~~~~^~~~~~~~~~~ btf_dump.c:870:13: note: ‘new_off’ was declared here 870 | int new_off, pad_bits, bits, i; | ^~~~~~~ btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized] 917 | btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 918 | in_bitfield ? new_off - cur_off : 0); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ btf_dump.c:871:21: note: ‘pad_type’ was declared here 871 | const char *pad_type; | ^~~~~~~~ btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized] 930 | if (bits == pad_bits) { | ^ btf_dump.c:870:22: note: ‘pad_bits’ was declared here 870 | int new_off, pad_bits, bits, i; | ^~~~~~~~ cc1: all warnings being treated as errors Signed-off-by: Eder Zulian <ezulian@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022172329.3871958-3-ezulian@redhat.com
2024-10-24resolve_btfids: Fix compiler warningsEder Zulian1-2/+2
Initialize 'set' and 'set8' pointers to NULL in sets_patch to prevent possible compiler warnings which are issued for various optimization levels, but do not happen when compiling with current default compilation options. For example, when compiling resolve_btfids with $ make "HOSTCFLAGS=-O2 -Wall" -C tools/bpf/resolve_btfids/ clean all Clang version 17.0.6 and GCC 13.3.1 issue following -Wmaybe-uninitialized warnings for variables 'set8' and 'set': In function ‘sets_patch’, inlined from ‘symbols_patch’ at main.c:748:6, inlined from ‘main’ at main.c:823:6: main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized] 163 | eprintf(1, verbose, pr_fmt(fmt), ##__VA_ARGS__) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ main.c:729:17: note: in expansion of macro ‘pr_debug’ 729 | pr_debug("sorting addr %5lu: cnt %6d [%s]\n", | ^~~~~~~~ main.c: In function ‘main’: main.c:682:37: note: ‘set8’ was declared here 682 | struct btf_id_set8 *set8; | ^~~~ In function ‘sets_patch’, inlined from ‘symbols_patch’ at main.c:748:6, inlined from ‘main’ at main.c:823:6: main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized] 163 | eprintf(1, verbose, pr_fmt(fmt), ##__VA_ARGS__) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ main.c:729:17: note: in expansion of macro ‘pr_debug’ 729 | pr_debug("sorting addr %5lu: cnt %6d [%s]\n", | ^~~~~~~~ main.c: In function ‘main’: main.c:683:36: note: ‘set’ was declared here 683 | struct btf_id_set *set; | ^~~ Signed-off-by: Eder Zulian <ezulian@redhat.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022172329.3871958-2-ezulian@redhat.com
2024-10-24bpf,perf: Fix perf_event_detach_bpf_prog error handlingJiri Olsa1-2/+0
Peter reported that perf_event_detach_bpf_prog might skip to release the bpf program for -ENOENT error from bpf_prog_array_copy. This can't happen because bpf program is stored in perf event and is detached and released only when perf event is freed. Let's drop the -ENOENT check and make sure the bpf program is released in any case. Fixes: 170a7e3ea070 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found") Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/
2024-10-23uprobe: Add support for session consumerJiri Olsa2-30/+139
This change allows the uprobe consumer to behave as session which means that 'handler' and 'ret_handler' callbacks are connected in a way that allows to: - control execution of 'ret_handler' from 'handler' callback - share data between 'handler' and 'ret_handler' callbacks The session concept fits to our common use case where we do filtering on entry uprobe and based on the result we decide to run the return uprobe (or not). It's also convenient to share the data between session callbacks. To achive this we are adding new return value the uprobe consumer can return from 'handler' callback: UPROBE_HANDLER_IGNORE - Ignore 'ret_handler' callback for this consumer. And store cookie and pass it to 'ret_handler' when consumer has both 'handler' and 'ret_handler' callbacks defined. We store shared data in the return_consumer object array as part of the return_instance object. This way the handle_uretprobe_chain can find related return_consumer and its shared data. We also store entry handler return value, for cases when there are multiple consumers on single uprobe and some of them are ignored and some of them not, in which case the return probe gets installed and we need to have a way to find out which consumer needs to be ignored. The tricky part is when consumer is registered 'after' the uprobe entry handler is hit. In such case this consumer's 'ret_handler' gets executed as well, but it won't have the proper data pointer set, so we can filter it out. Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241018202252.693462-3-jolsa@kernel.org
2024-10-23uprobe: Add data pointer to consumer handlersJiri Olsa5-11/+17
Adding data pointer to both entry and exit consumer handlers and all its users. The functionality itself is coming in following change. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20241018202252.693462-2-jolsa@kernel.org
2024-10-23selftests/bpf: Increase verifier log limit in veristatMykyta Yatsenko1-1/+31
The current default buffer size of 16MB allocated by veristat is no longer sufficient to hold the verifier logs of some production BPF programs. To address this issue, we need to increase the verifier log limit. Commit 7a9f5c65abcc ("bpf: increase verifier log limit") has already increased the supported buffer size by the kernel, but veristat users need to explicitly pass a log size argument to use the bigger log. This patch adds a function to detect the maximum verifier log size supported by the kernel and uses that by default in veristat. This ensures that veristat can handle larger verifier logs without requiring users to manually specify the log size. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241023155314.126255-1-mykyta.yatsenko5@gmail.com
2024-10-23Bluetooth: ISO: Fix UAF on iso_sock_timeoutLuiz Augusto von Dentz1-6/+12
conn->sk maybe have been unlinked/freed while waiting for iso_conn_lock so this checks if the conn->sk is still valid by checking if it part of iso_sk_list. Fixes: ccf74f2390d6 ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-10-23Bluetooth: SCO: Fix UAF on sco_sock_timeoutLuiz Augusto von Dentz3-6/+35
conn->sk maybe have been unlinked/freed while waiting for sco_conn_lock so this checks if the conn->sk is still valid by checking if it part of sco_sk_list. Reported-by: syzbot+4c0d0c4cde787116d465@syzkaller.appspotmail.com Tested-by: syzbot+4c0d0c4cde787116d465@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=4c0d0c4cde787116d465 Fixes: ba316be1b6a0 ("Bluetooth: schedule SCO timeouts with delayed_work") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-10-23Bluetooth: hci_core: Disable works on hci_unregister_devLuiz Augusto von Dentz2-12/+24
This make use of disable_work_* on hci_unregister_dev since the hci_dev is about to be freed new submissions are not disarable. Fixes: 0d151a103775 ("Bluetooth: hci_core: cancel all works upon hci_unregister_dev()") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2024-10-23LoongArch: KVM: Mark hrtimer to expire in hard interrupt contextHuacai Chen2-4/+5
Like commit 2c0d278f3293f ("KVM: LAPIC: Mark hrtimer to expire in hard interrupt context") and commit 9090825fa9974 ("KVM: arm/arm64: Let the timer expire in hardirq context on RT"), On PREEMPT_RT enabled kernels unmarked hrtimers are moved into soft interrupt expiry mode by default. Then the timers are canceled from an preempt-notifier which is invoked with disabled preemption which is not allowed on PREEMPT_RT. The timer callback is short so in could be invoked in hard-IRQ context. So let the timer expire on hard-IRQ context even on -RT. This fix a "scheduling while atomic" bug for PREEMPT_RT enabled kernels: BUG: scheduling while atomic: qemu-system-loo/1011/0x00000002 Modules linked in: amdgpu rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ns CPU: 1 UID: 0 PID: 1011 Comm: qemu-system-loo Tainted: G W 6.12.0-rc2+ #1774 Tainted: [W]=WARN Hardware name: Loongson Loongson-3A5000-7A1000-1w-CRB/Loongson-LS3A5000-7A1000-1w-CRB, BIOS vUDK2018-LoongArch-V2.0.0-prebeta9 10/21/2022 Stack : ffffffffffffffff 0000000000000000 9000000004e3ea38 9000000116744000 90000001167475a0 0000000000000000 90000001167475a8 9000000005644830 90000000058dc000 90000000058dbff8 9000000116747420 0000000000000001 0000000000000001 6a613fc938313980 000000000790c000 90000001001c1140 00000000000003fe 0000000000000001 000000000000000d 0000000000000003 0000000000000030 00000000000003f3 000000000790c000 9000000116747830 90000000057ef000 0000000000000000 9000000005644830 0000000000000004 0000000000000000 90000000057f4b58 0000000000000001 9000000116747868 900000000451b600 9000000005644830 9000000003a13998 0000000010000020 00000000000000b0 0000000000000004 0000000000000000 0000000000071c1d ... Call Trace: [<9000000003a13998>] show_stack+0x38/0x180 [<9000000004e3ea34>] dump_stack_lvl+0x84/0xc0 [<9000000003a71708>] __schedule_bug+0x48/0x60 [<9000000004e45734>] __schedule+0x1114/0x1660 [<9000000004e46040>] schedule_rtlock+0x20/0x60 [<9000000004e4e330>] rtlock_slowlock_locked+0x3f0/0x10a0 [<9000000004e4f038>] rt_spin_lock+0x58/0x80 [<9000000003b02d68>] hrtimer_cancel_wait_running+0x68/0xc0 [<9000000003b02e30>] hrtimer_cancel+0x70/0x80 [<ffff80000235eb70>] kvm_restore_timer+0x50/0x1a0 [kvm] [<ffff8000023616c8>] kvm_arch_vcpu_load+0x68/0x2a0 [kvm] [<ffff80000234c2d4>] kvm_sched_in+0x34/0x60 [kvm] [<9000000003a749a0>] finish_task_switch.isra.0+0x140/0x2e0 [<9000000004e44a70>] __schedule+0x450/0x1660 [<9000000004e45cb0>] schedule+0x30/0x180 [<ffff800002354c70>] kvm_vcpu_block+0x70/0x120 [kvm] [<ffff800002354d80>] kvm_vcpu_halt+0x60/0x3e0 [kvm] [<ffff80000235b194>] kvm_handle_gspr+0x3f4/0x4e0 [kvm] [<ffff80000235f548>] kvm_handle_exit+0x1c8/0x260 [kvm] Reviewed-by: Bibo Mao <maobibo@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-10-23LoongArch: Make KASAN usable for variable cpu_vabitsHuacai Chen1-1/+1
Currently, KASAN on LoongArch assume the CPU VA bits is 48, which is true for Loongson-3 series, but not for Loongson-2 series (only 40 or lower), this patch fix that issue and make KASAN usable for variable cpu_vabits. Solution is very simple: Just define XRANGE_SHADOW_SHIFT which means valid address length from VA_BITS to min(cpu_vabits, VA_BITS). Cc: stable@vger.kernel.org Signed-off-by: Kanglong Wang <wangkanglong@loongson.cn> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2024-10-23posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()Jinjie Ruan1-3/+3
If get_clock_desc() succeeds, it calls fget() for the clockid's fd, and get the clk->rwsem read lock, so the error path should release the lock to make the lock balance and fput the clockid's fd to make the refcount balance and release the fd related resource. However the below commit left the error path locked behind resulting in unbalanced locking. Check timespec64_valid_strict() before get_clock_desc() to fix it, because the "ts" is not changed after that. Fixes: d8794ac20a29 ("posix-clock: Fix missing timespec64 check in pc_clock_settime()") Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de> [pabeni@redhat.com: fixed commit message typo] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23r8169: avoid unsolicited interruptsHeiner Kallweit1-1/+3
It was reported that after resume from suspend a PCI error is logged and connectivity is broken. Error message is: PCI error (cmd = 0x0407, status_errs = 0x0000) The message seems to be a red herring as none of the error bits is set, and the PCI command register value also is normal. Exception handling for a PCI error includes a chip reset what apparently brakes connectivity here. The interrupt status bit triggering the PCI error handling isn't actually used on PCIe chip versions, so it's not clear why this bit is set by the chip. Fix this by ignoring this bit on PCIe chip versions. Fixes: 0e4851502f84 ("r8169: merge with version 8.001.00 of Realtek's r8168 driver") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219388 Tested-by: Atlas Yu <atlas.yu@canonical.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/78e2f535-438f-4212-ad94-a77637ac6c9c@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net: sched: use RCU read-side critical section in taprio_dump()Dmitry Antipov1-6/+12
Fix possible use-after-free in 'taprio_dump()' by adding RCU read-side critical section there. Never seen on x86 but found on a KASAN-enabled arm64 system when investigating https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa: [T15862] BUG: KASAN: slab-use-after-free in taprio_dump+0xa0c/0xbb0 [T15862] Read of size 4 at addr ffff0000d4bb88f8 by task repro/15862 [T15862] [T15862] CPU: 0 UID: 0 PID: 15862 Comm: repro Not tainted 6.11.0-rc1-00293-gdefaf1a2113a-dirty #2 [T15862] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-20240524-5.fc40 05/24/2024 [T15862] Call trace: [T15862] dump_backtrace+0x20c/0x220 [T15862] show_stack+0x2c/0x40 [T15862] dump_stack_lvl+0xf8/0x174 [T15862] print_report+0x170/0x4d8 [T15862] kasan_report+0xb8/0x1d4 [T15862] __asan_report_load4_noabort+0x20/0x2c [T15862] taprio_dump+0xa0c/0xbb0 [T15862] tc_fill_qdisc+0x540/0x1020 [T15862] qdisc_notify.isra.0+0x330/0x3a0 [T15862] tc_modify_qdisc+0x7b8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Allocated by task 15857: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_alloc_info+0x40/0x60 [T15862] __kasan_kmalloc+0xd4/0xe0 [T15862] __kmalloc_cache_noprof+0x194/0x334 [T15862] taprio_change+0x45c/0x2fe0 [T15862] tc_modify_qdisc+0x6a8/0x1838 [T15862] rtnetlink_rcv_msg+0x3c8/0xc20 [T15862] netlink_rcv_skb+0x1f8/0x3d4 [T15862] rtnetlink_rcv+0x28/0x40 [T15862] netlink_unicast+0x51c/0x790 [T15862] netlink_sendmsg+0x79c/0xc20 [T15862] __sock_sendmsg+0xe0/0x1a0 [T15862] ____sys_sendmsg+0x6c0/0x840 [T15862] ___sys_sendmsg+0x1ac/0x1f0 [T15862] __sys_sendmsg+0x110/0x1d0 [T15862] __arm64_sys_sendmsg+0x74/0xb0 [T15862] invoke_syscall+0x88/0x2e0 [T15862] el0_svc_common.constprop.0+0xe4/0x2a0 [T15862] do_el0_svc+0x44/0x60 [T15862] el0_svc+0x50/0x184 [T15862] el0t_64_sync_handler+0x120/0x12c [T15862] el0t_64_sync+0x190/0x194 [T15862] [T15862] Freed by task 6192: [T15862] kasan_save_stack+0x3c/0x70 [T15862] kasan_save_track+0x20/0x3c [T15862] kasan_save_free_info+0x4c/0x80 [T15862] poison_slab_object+0x110/0x160 [T15862] __kasan_slab_free+0x3c/0x74 [T15862] kfree+0x134/0x3c0 [T15862] taprio_free_sched_cb+0x18c/0x220 [T15862] rcu_core+0x920/0x1b7c [T15862] rcu_core_si+0x10/0x1c [T15862] handle_softirqs+0x2e8/0xd64 [T15862] __do_softirq+0x14/0x20 Fixes: 18cdd2f0998a ("net/sched: taprio: taprio_dump and taprio_change are protected by rtnl_mutex") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://patch.msgid.link/20241018051339.418890-2-dmantipov@yandex.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net: sched: fix use-after-free in taprio_change()Dmitry Antipov1-1/+2
In 'taprio_change()', 'admin' pointer may become dangling due to sched switch / removal caused by 'advance_sched()', and critical section protected by 'q->current_entry_lock' is too small to prevent from such a scenario (which causes use-after-free detected by KASAN). Fix this by prefer 'rcu_replace_pointer()' over 'rcu_assign_pointer()' to update 'admin' immediately before an attempt to schedule freeing. Fixes: a3d43c0d56f1 ("taprio: Add support adding an admin schedule") Reported-by: syzbot+b65e0af58423fc8a73aa@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b65e0af58423fc8a73aa Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Link: https://patch.msgid.link/20241018051339.418890-1-dmantipov@yandex.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions ↵Vladimir Oltean1-1/+22
created by classifiers tcf_action_init() has logic for checking mismatches between action and filter offload flags (skip_sw/skip_hw). AFAIU, this is intended to run on the transition between the new tc_act_bind(flags) returning true (aka now gets bound to classifier) and tc_act_bind(act->tcfa_flags) returning false (aka action was not bound to classifier before). Otherwise, the check is skipped. For the case where an action is not standalone, but rather it was created by a classifier and is bound to it, tcf_action_init() skips the check entirely, and this means it allows mismatched flags to occur. Taking the matchall classifier code path as an example (with mirred as an action), the reason is the following: 1 | mall_change() 2 | -> mall_replace_hw_filter() 3 | -> tcf_exts_validate_ex() 4 | -> flags |= TCA_ACT_FLAGS_BIND; 5 | -> tcf_action_init() 6 | -> tcf_action_init_1() 7 | -> a_o->init() 8 | -> tcf_mirred_init() 9 | -> tcf_idr_create_from_flags() 10 | -> tcf_idr_create() 11 | -> p->tcfa_flags = flags; 12 | -> tc_act_bind(flags)) 13 | -> tc_act_bind(act->tcfa_flags) When invoked from tcf_exts_validate_ex() like matchall does (but other classifiers validate their extensions as well), tcf_action_init() runs in a call path where "flags" always contains TCA_ACT_FLAGS_BIND (set by line 4). So line 12 is always true, and line 13 is always true as well. No transition ever takes place, and the check is skipped. The code was added in this form in commit c86e0209dc77 ("flow_offload: validate flags of filter and actions"), but I'm attributing the blame even earlier in that series, to when TCA_ACT_FLAGS_SKIP_HW and TCA_ACT_FLAGS_SKIP_SW were added to the UAPI. Following the development process of this change, the check did not always exist in this form. A change took place between v3 [1] and v4 [2], AFAIU due to review feedback that it doesn't make sense for action flags to be different than classifier flags. I think I agree with that feedback, but it was translated into code that omits enforcing this for "classic" actions created at the same time with the filters themselves. There are 3 more important cases to discuss. First there is this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 which should be allowed, because prior to the concept of dedicated action flags, it used to work and it used to mean the action inherited the skip_sw/skip_hw flags from the classifier. It's not a mismatch. Then we have this command: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_hw where there is a mismatch and it should be rejected. Finally, we have: $ tc qdisc add dev eth0 clasct $ tc filter add dev eth0 ingress matchall skip_sw \ action mirred ingress mirror dev eth1 skip_sw where the offload flags coincide, and this should be treated the same as the first command based on inheritance, and accepted. [1]: https://lore.kernel.org/netdev/20211028110646.13791-9-simon.horman@corigine.com/ [2]: https://lore.kernel.org/netdev/20211118130805.23897-10-simon.horman@corigine.com/ Fixes: 7adc57651211 ("flow_offload: add skip_hw and skip_sw to control if offload the action") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20241017161049.3570037-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23tracing: Consider the NULL character when validating the event lengthLeo Yan1-1/+1
strlen() returns a string length excluding the null byte. If the string length equals to the maximum buffer length, the buffer will have no space for the NULL terminating character. This commit checks this condition and returns failure for it. Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/ Fixes: dec65d79fd26 ("tracing/probe: Check event name length correctly") Signed-off-by: Leo Yan <leo.yan@arm.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-23tracing/probes: Fix MAX_TRACE_ARGS limit handlingMikel Rychliski4-4/+19
When creating a trace_probe we would set nr_args prior to truncating the arguments to MAX_TRACE_ARGS. However, we would only initialize arguments up to the limit. This caused invalid memory access when attempting to set up probes with more than 128 fetchargs. BUG: kernel NULL pointer dereference, address: 0000000000000020 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014 RIP: 0010:__set_print_fmt+0x134/0x330 Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return an error when there are too many arguments instead of silently truncating. Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/ Fixes: 035ba76014c0 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init") Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-23selftests/bpf: Add test for passing in uninit mtu_lenDaniel Borkmann2-0/+37
Add a small test to pass an uninitialized mtu_len to the bpf_check_mtu() helper to probe whether the verifier rejects it under !CAP_PERFMON. # ./vmtest.sh -- ./test_progs -t verifier_mtu [...] ./test_progs -t verifier_mtu [ 1.414712] tsc: Refined TSC clocksource calibration: 3407.993 MHz [ 1.415327] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcd52370, max_idle_ns: 440795242006 ns [ 1.416463] clocksource: Switched to clocksource tsc [ 1.429842] bpf_testmod: loading out-of-tree module taints kernel. [ 1.430283] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #510/1 verifier_mtu/uninit/mtu: write rejected:OK #510 verifier_mtu:OK Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-5-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23selftests/bpf: Add test for writes to .rodataDaniel Borkmann1-1/+30
Add a small test to write a (verification-time) fixed vs unknown but bounded-sized buffer into .rodata BPF map and assert that both get rejected. # ./vmtest.sh -- ./test_progs -t verifier_const [...] ./test_progs -t verifier_const [ 1.418717] tsc: Refined TSC clocksource calibration: 3407.994 MHz [ 1.419113] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fcde90a1, max_idle_ns: 440795222066 ns [ 1.419972] clocksource: Switched to clocksource tsc [ 1.449596] bpf_testmod: loading out-of-tree module taints kernel. [ 1.449958] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel #475/1 verifier_const/rodata/strtol: write rejected:OK #475/2 verifier_const/bss/strtol: write accepted:OK #475/3 verifier_const/data/strtol: write accepted:OK #475/4 verifier_const/rodata/mtu: write rejected:OK #475/5 verifier_const/bss/mtu: write accepted:OK #475/6 verifier_const/data/mtu: write accepted:OK #475/7 verifier_const/rodata/mark: write with unknown reg rejected:OK #475/8 verifier_const/rodata/mark: write with unknown reg rejected:OK #475 verifier_const:OK #476/1 verifier_const_or/constant register |= constant should keep constant type:OK #476/2 verifier_const_or/constant register |= constant should not bypass stack boundary checks:OK #476/3 verifier_const_or/constant register |= constant register should keep constant type:OK #476/4 verifier_const_or/constant register |= constant register should not bypass stack boundary checks:OK #476 verifier_const_or:OK Summary: 2/12 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23bpf: Remove MEM_UNINIT from skb/xdp MTU helpersDaniel Borkmann1-27/+15
We can now undo parts of 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error") as discussed in [0]. Given the BPF helpers now have MEM_WRITE tag, the MEM_UNINIT can be cleared. The mtu_len is an input as well as output argument, meaning, the BPF program has to set it to something. It cannot be uninitialized. Therefore, allowing uninitialized memory and zeroing it on error would be odd. It was done as an interim step in 4b3786a6c539 as the desired behavior could not have been expressed before the introduction of MEM_WRITE tag. Fixes: 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/a86eb76d-f52f-dee4-e5d2-87e45de3e16f@iogearbox.net [0] Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23bpf: Fix overloading of MEM_UNINIT's meaningDaniel Borkmann1-38/+35
Lonial reported an issue in the BPF verifier where check_mem_size_reg() has the following code: if (!tnum_is_const(reg->var_off)) /* For unprivileged variable accesses, disable raw * mode so that the program is required to * initialize all the memory that the helper could * just partially fill up. */ meta = NULL; This means that writes are not checked when the register containing the size of the passed buffer has not a fixed size. Through this bug, a BPF program can write to a map which is marked as read-only, for example, .rodata global maps. The problem is that MEM_UNINIT's initial meaning that "the passed buffer to the BPF helper does not need to be initialized" which was added back in commit 435faee1aae9 ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type") got overloaded over time with "the passed buffer is being written to". The problem however is that checks such as the above which were added later via 06c1c049721a ("bpf: allow helpers access to variable memory") set meta to NULL in order force the user to always initialize the passed buffer to the helper. Due to the current double meaning of MEM_UNINIT, this bypasses verifier write checks to the memory (not boundary checks though) and only assumes the latter memory is read instead. Fix this by reverting MEM_UNINIT back to its original meaning, and having MEM_WRITE as an annotation to BPF helpers in order to then trigger the BPF verifier checks for writing to memory. Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO} we can access fn->arg_type[arg - 1] since it must contain a preceding ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed altogether since we do check both BPF_READ and BPF_WRITE. Same for the equivalent check_kfunc_mem_size_reg(). Fixes: 7b3552d3f9f6 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access") Fixes: 97e6d7dab1ca ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access") Fixes: 15baa55ff5b0 ("bpf/verifier: allow all functions to read user provided context") Reported-by: Lonial Con <kongln9170@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23bpf: Add MEM_WRITE attributeDaniel Borkmann6-14/+22
Add a MEM_WRITE attribute for BPF helper functions which can be used in bpf_func_proto to annotate an argument type in order to let the verifier know that the helper writes into the memory passed as an argument. In the past MEM_UNINIT has been (ab)used for this function, but the latter merely tells the verifier that the passed memory can be uninitialized. There have been bugs with overloading the latter but aside from that there are also cases where the passed memory is read + written which currently cannot be expressed, see also 4b3786a6c539 ("bpf: Zero former ARG_PTR_TO_{LONG,INT} args in case of error"). Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22Merge branch 'Retire test_sock.c'Martin KaFai Lau5-205/+148
Jordan Rife says: ==================== This patch series migrates test cases out of test_sock.c to prog_tests-style tests. It moves all BPF_CGROUP_INET4_POST_BIND and BPF_CGROUP_INET6_POST_BIND test cases into a new prog_test, sock_post_bind.c, while reimplementing all LOAD_REJECT test cases as verifier tests in progs/verifier_sock.c. Finally, it moves remaining BPF_CGROUP_INET_SOCK_CREATE test coverage into prog_tests/sock_create.c before retiring test_sock.c completely. Changes ======= v1->v2: - Remove superfluous verbose bool from the top of sock_post_bind.c. - Use ASSERT_OK_FD instead of ASSERT_GE to test cgroup_fd validity. - Run sock_post_bind tests in their own namespace, "sock_post_bind". ==================== Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-22selftests/bpf: Retire test_sock.cJordan Rife3-234/+1
Completely remove test_sock.c and associated config. Signed-off-by: Jordan Rife <jrife@google.com> Link: https://lore.kernel.org/r/20241022152913.574836-5-jrife@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-22selftests/bpf: Migrate BPF_CGROUP_INET_SOCK_CREATE test cases to prog_testsJordan Rife2-38/+25
Move the "load w/o expected_attach_type" test case to prog_tests/sock_create.c and drop the remaining test case, as it is made redundant with the existing coverage inside prog_tests/sock_create.c. Signed-off-by: Jordan Rife <jrife@google.com> Link: https://lore.kernel.org/r/20241022152913.574836-4-jrife@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-22selftests/bpf: Migrate LOAD_REJECT test cases to prog_testsJordan Rife2-52/+60
Move LOAD_REJECT test cases from test_sock.c to an equivalent set of verifier tests in progs/verifier_sock.c. Signed-off-by: Jordan Rife <jrife@google.com> Link: https://lore.kernel.org/r/20241022152913.574836-3-jrife@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-22selftests/bpf: Migrate *_POST_BIND test cases to prog_testsJordan Rife2-245/+426
Move all BPF_CGROUP_INET6_POST_BIND and BPF_CGROUP_INET4_POST_BIND test cases to a new prog_test, prog_tests/sock_post_bind.c, except for LOAD_REJECT test cases. Signed-off-by: Jordan Rife <jrife@google.com> Link: https://lore.kernel.org/r/20241022152913.574836-2-jrife@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2024-10-22bpf: Preserve param->string when parsing mount optionsHou Tao1-2/+3
In bpf_parse_param(), keep the value of param->string intact so it can be freed later. Otherwise, the kmalloc area pointed to by param->string will be leaked as shown below: unreferenced object 0xffff888118c46d20 (size 8): comm "new_name", pid 12109, jiffies 4295580214 hex dump (first 8 bytes): 61 6e 79 00 38 c9 5c 7e any.8.\~ backtrace (crc e1b7f876): [<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80 [<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0 [<000000003e29b886>] memdup_user+0x32/0xa0 [<0000000007248326>] strndup_user+0x46/0x60 [<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0 [<0000000018657927>] x64_sys_call+0xff/0x9f0 [<00000000c0cabc95>] do_syscall_64+0x3b/0xc0 [<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53 Fixes: 6c1752e0b6ca ("bpf: Support symbolic BPF FS delegation mount options") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
2024-10-22jfs: Fix sanity check in dbMountDave Kleikamp1-1/+1
MAXAG is a legitimate value for bmp->db_numag Fixes: e63866a47556 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()") Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-10-22btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()Yue Haibing2-7/+4
The ret may be zero in btrfs_search_dir_index_item() and should not passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0. This fixes smatch warnings: fs/btrfs/dir-item.c:353 btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR' Fixes: 9dcbe16fccbb ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: reject ro->rw reconfiguration if there are hard ro requirementsQu Wenruo1-2/+1
[BUG] Syzbot reports the following crash: BTRFS info (device loop0 state MCS): disabling free space tree BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1) BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline] RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041 Call Trace: <TASK> btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530 btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312 btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012 btrfs_remount_rw fs/btrfs/super.c:1309 [inline] btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534 btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline] btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline] btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115 vfs_get_tree+0x90/0x2b0 fs/super.c:1800 do_new_mount+0x2be/0xb40 fs/namespace.c:3472 do_mount fs/namespace.c:3812 [inline] __do_sys_mount fs/namespace.c:4020 [inline] __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] To support mounting different subvolume with different RO/RW flags for the new mount APIs, btrfs introduced two workaround to support this feature: - Skip mount option/feature checks if we are mounting a different subvolume - Reconfigure the fs to RW if the initial mount is RO Combining these two, we can have the following sequence: - Mount the fs ro,rescue=all,clear_cache,space_cache=v1 rescue=all will mark the fs as hard read-only, so no v2 cache clearing will happen. - Mount a subvolume rw of the same fs. We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY because our new fc is RW, different from the original fs. Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag first so that we can grab the existing fs_info. Then we reconfigure the fs to RW. - During reconfiguration, option/features check is skipped This means we will restart the v2 cache clearing, and convert back to v1 cache. This will trigger fs writes, and since the original fs has "rescue=all" option, it skips the csum tree read. And eventually causing NULL pointer dereference in super block writeback. [FIX] For reconfiguration caused by different subvolume RO/RW flags, ensure we always run btrfs_check_options() to ensure we have proper hard RO requirements met. In fact the function btrfs_check_options() doesn't really do many complex checks, but hard RO requirement and some feature dependency checks, thus there is no special reason not to do the check for mount reconfiguration. Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/ Fixes: f044b318675f ("btrfs: handle the ro->rw transition for mounting different subvolumes") CC: stable@vger.kernel.org # 6.8+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: fix read corruption due to race with extent map mergingBoris Burkov1-15/+16
In debugging some corrupt squashfs files, we observed symptoms of corrupt page cache pages but correct on-disk contents. Further investigation revealed that the exact symptom was a correct page followed by an incorrect, duplicate, page. This got us thinking about extent maps. commit ac05ca913e9f ("Btrfs: fix race between using extent maps and merging them") enforces a reference count on the primary `em` extent_map being merged, as that one gets modified. However, since, commit 3d2ac9922465 ("btrfs: introduce new members for extent_map") both 'em' and 'merge' get modified, which started modifying 'merge' and thus introduced the same race. We were able to reproduce this by looping the affected squashfs workload in parallel on a bunch of separate btrfs-es while also dropping caches. We are still working on a simple enough reproducer to make into an fstest. The simplest fix is to stop modifying 'merge', which is not essential, as it is dropped immediately after the merge. This behavior is simply a consequence of the order of the two extent maps being important in computing the new values. Modify merge_ondisk_extents to take prev and next by const* and also take a third merged parameter that it puts the results in. Note that this introduces the rather odd behavior of passing 'em' to merge_ondisk_extents as a const * and as a regular ptr. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") CC: stable@vger.kernel.org # 6.11+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: fix the delalloc range locking if sector size < page sizeQu Wenruo1-8/+9
Inside lock_delalloc_folios(), there are several problems related to sector size < page size handling: - Set the writer locks without checking if the folio is still valid We call btrfs_folio_start_writer_lock() just like it's folio_lock(). But since the folio may not even be the folio of the current mapping, we can easily screw up the folio->private. - The range is not clamped inside the page This means we can over write other bitmaps if the start/len is not properly handled, and trigger the btrfs_subpage_assert(). - @processed_end is always rounded up to page end If the delalloc range is not page aligned, and we need to retry (returning -EAGAIN), then we will unlock to the page end. Thankfully this is not a huge problem, as now btrfs_folio_end_writer_lock() can handle range larger than the locked range, and only unlock what is already locked. Fix all these problems by: - Lock and check the folio first, then call btrfs_folio_set_writer_lock() So that if we got a folio not belonging to the inode, we won't touch folio->private. - Properly truncate the range inside the page - Update @processed_end to the locked range end Fixes: 1e1de38792e0 ("btrfs: make process_one_page() to handle subpage locking") CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: qgroup: set a more sane default value for subtree drop thresholdQu Wenruo3-2/+4
Since commit 011b46c30476 ("btrfs: skip subtree scan if it's too high to avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can automatically skip large subtree scan at the cost of marking qgroup inconsistent. It's designed to address the final performance problem of snapshot drop with qgroup enabled, but to be safe the default value is BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value to make it work. I'd say it's not a good idea to rely on user space tool to set this default value, especially when some operations (snapshot dropping) can be triggered immediately after mount, leaving a very small window to that that sysfs interface. So instead of disabling this new feature by default, enable it with a low threshold (3), so that large subvolume tree drop at mount time won't cause huge qgroup workload. CC: stable@vger.kernel.org # 6.1 Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22btrfs: clear force-compress on remount when compress mount option is givenFilipe Manana1-0/+9
After the migration to use fs context for processing mount options we had a slight change in the semantics for remounting a filesystem that was mounted with compress-force. Before we could clear compress-force by passing only "-o compress[=algo]" during a remount, but after that change that does not work anymore, force-compress is still present and one needs to pass "-o compress-force=no,compress[=algo]" to the mount command. Example, when running on a kernel 6.8+: $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) On a 6.7 kernel (or older): $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/) $ mount -o remount,compress=zlib:5 /mnt/sdi $ mount | grep sdi /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/) So update btrfs_parse_param() to clear "compress-force" when "compress" is given, providing the same semantics as kernel 6.7 and older. Reported-by: Roman Mamedov <rm@romanrm.net> Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/ CC: stable@vger.kernel.org # 6.8+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22net: usb: usbnet: fix name regressionOliver Neukum1-1/+2
The fix for MAC addresses broke detection of the naming convention because it gave network devices no random MAC before bind() was called. This means that the check for the local assignment bit was always negative as the address was zeroed from allocation, instead of from overwriting the MAC with a unique hardware address. The correct check for whether bind() has altered the MAC is done with is_zero_ether_addr Signed-off-by: Oliver Neukum <oneukum@suse.com> Reported-by: Greg Thelen <gthelen@google.com> Diagnosed-by: John Sperbeck <jsperbeck@google.com> Fixes: bab8eb0dd4cb9 ("usbnet: modern method to get random MAC") Link: https://patch.msgid.link/20241017071849.389636-1-oneukum@suse.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-22mlxsw: spectrum_router: fix xa_store() error checkingYuan Can1-6/+3
It is meant to use xa_err() to extract the error encoded in the return value of xa_store(). Fixes: 44c2fbebe18a ("mlxsw: spectrum_router: Share nexthop counters in resilient groups") Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Tested-by: Petr Machata <petrm@nvidia.com> Link: https://patch.msgid.link/20241017023223.74180-1-yuancan@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-22Merge tag 'nf-24-10-21' of ↵Paolo Abeni4-2/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== This patchset contains Netfilter fixes for net: 1) syzkaller managed to triger UaF due to missing reference on netns in bpf infrastructure, from Florian Westphal. 2) Fix incorrect conversion from NFPROTO_UNSPEC to NFPROTO_{IPV4,IPV6} in the following xtables targets: MARK and NFLOG. Moreover, add missing I have my half share in this mistake, I did not take the necessary time to review this: For several years I have been struggling to keep working on Netfilter, juggling a myriad of side consulting projects to stop burning my own savings. I have extended the iptables-tests.py test infrastructure to improve the coverage of ip6tables and detect similar problems in the future. This is a v2 including a extended PR with one more fix. netfilter pull request 24-10-21 * tag 'nf-24-10-21' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: xtables: fix typo causing some targets not to load on IPv6 netfilter: bpf: must hold reference on net namespace ==================== Link: https://patch.msgid.link/20241021094536.81487-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-22objpool: fix choosing allocation for percpu slotsViktor Malik1-1/+1
objpool intends to use vmalloc for default (non-atomic) allocations of percpu slots and objects. However, the condition checking if GFP flags set any bit of GFP_ATOMIC is wrong b/c GFP_ATOMIC is a combination of bits (__GFP_HIGH|__GFP_KSWAPD_RECLAIM) and so `pool->gfp & GFP_ATOMIC` will be true if either bit is set. Since GFP_ATOMIC and GFP_KERNEL share the ___GFP_KSWAPD_RECLAIM bit, kmalloc will be used in cases when GFP_KERNEL is specified, i.e. in all current usages of objpool. This may lead to unexpected OOM errors since kmalloc cannot allocate large amounts of memory. For instance, objpool is used by fprobe rethook which in turn is used by BPF kretprobe.multi and kprobe.session probe types. Trying to attach these to all kernel functions with libbpf using SEC("kprobe.session/*") int kprobe(struct pt_regs *ctx) { [...] } fails on objpool slot allocation with ENOMEM. Fix the condition to truly use vmalloc by default. Link: https://lore.kernel.org/all/20240826060718.267261-1-vmalik@redhat.com/ Fixes: b4edb8d2d464 ("lib: objpool added: ring-array based lockless MPMC") Signed-off-by: Viktor Malik <vmalik@redhat.com> Acked-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Matt Wu <wuqiang.matt@bytedance.com> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-22KVM: selftests: Fix build on on non-x86 architecturesMark Brown1-1/+3
Commit 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") unconditionally added -march=x86-64-v2 to the CFLAGS used to build the KVM selftests which does not work on non-x86 architectures: cc1: error: unknown value ‘x86-64-v2’ for ‘-march’ Fix this by making the addition of this x86 specific command line flag conditional on building for x86. Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-229p: fix slab cache name creation for realLinus Torvalds1-1/+3
This was attempted by using the dev_name in the slab cache name, but as Omar Sandoval pointed out, that can be an arbitrary string, eg something like "/dev/root". Which in turn trips verify_dirent_name(), which fails if a filename contains a slash. So just make it use a sequence counter, and make it an atomic_t to avoid any possible races or locking issues. Reported-and-tested-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/all/ZxafcO8KWMlXaeWE@telecaster.dhcp.thefacebook.com/ Fixes: 79efebae4afc ("9p: Avoid creating multiple slab caches with the same name") Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Thorsten Leemhuis <regressions@leemhuis.info> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-22Merge branch 'implement-mechanism-to-signal-other-threads'Andrii Nakryiko4-46/+176
Puranjay Mohan says: ==================== Implement mechanism to signal other threads This set implements a kfunc called bpf_send_signal_task() that is similar to sigqueue() as it can send a signal along with a cookie to a thread or thread group. The send_signal selftest has been updated to also test this new kfunc under all contexts. Changes in v5: v4: https://lore.kernel.org/all/20241008114940.44305-1-puranjay@kernel.org/ - Call copy_siginfo() only if work->has_siginfo is true in bpf_send_signal_common() - Add Acked-by: Andrii Nakryiko <andrii@kernel.org> Changes in v4: v3: https://lore.kernel.org/all/20241007103426.128923-1-puranjay@kernel.org/ - Fix the selftest to make it work for big-endian archs. - Fix a build warning on 32-bit archs. - Some style changes and code refactors suggested by Andrii Changes in v3: v2: https://lore.kernel.org/all/20240926115328.105634-1-puranjay@kernel.org/ - make the cookie u64 instead of int. - re use code from bpf_send_signal_common Changes in v2: v1: https://lore.kernel.org/bpf/20240724113944.75977-1-puranjay@kernel.org/ - Convert to a kfunc - Add mechanism to send a cookie with the signal. ==================== Link: https://lore.kernel.org/r/20241016084136.10305-1-puranjay@kernel.org Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2024-10-22selftests/bpf: Augment send_signal test with remote signalingPuranjay Mohan2-38/+130
Add testcases to test bpf_send_signal_task(). In these new test cases, the main process triggers the BPF program and the forked process receives the signals. The target process's signal handler receives a cookie from the bpf program. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016084136.10305-3-puranjay@kernel.org
2024-10-22bpf: Implement bpf_send_signal_task() kfuncPuranjay Mohan2-8/+46
Implement bpf_send_signal_task kfunc that is similar to bpf_send_signal_thread and bpf_send_signal helpers but can be used to send signals to other threads and processes. It also supports sending a cookie with the signal similar to sigqueue(). If the receiving process establishes a handler for the signal using the SA_SIGINFO flag to sigaction(), then it can obtain this cookie via the si_value field of the siginfo_t structure passed as the second argument to the handler. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241016084136.10305-2-puranjay@kernel.org
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds25-103/+277
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-21Merge tag 'probes-fixes-v6.12-rc4' of ↵Linus Torvalds1-3/+6
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull uprobe fix from Masami Hiramatsu: - uprobe: avoid out-of-bounds memory access of fetching args Uprobe trace events can cause out-of-bounds memory access when fetching user-space data which is bigger than one page, because it does not check the local CPU buffer size when reading the data. This checks the read data size and cut it down to the local CPU buffer size. * tag 'probes-fixes-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: uprobe: avoid out-of-bounds memory access of fetching args