Age | Commit message (Collapse) | Author | Files | Lines |
|
Previously when retrieving `bpf_link_info.perf_event` for
kprobe/uprobe/tracepoint, the `name_len` field was not populated by the
kernel, leaving it to reflect the value initially set by the user. This
behavior was inconsistent with how other input/output string buffer
fields function (e.g. `raw_tracepoint.tp_name_len`).
This patch fills `name_len` with the actual size of the string name.
Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20241008164312.46269-1-wudevelops@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The kzmalloc call in bpf_check can fail when memory is very fragmented,
which in turn can lead to an OOM kill.
Use kvzmalloc to fall back to vmalloc when memory is too fragmented to
allocate an order 3 sized bpf verifier environment.
Admittedly this is not a very common case, and only happens on systems
where memory has already been squeezed close to the limit, but this does
not seem like much of a hot path, and it's a simple enough fix.
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
When trying to repeat the btf fields for array of nested struct, it
doesn't check the remaining info_cnt. The following splat will be
reported when the value of ret * nelems is greater than BTF_FIELDS_MAX:
------------[ cut here ]------------
UBSAN: array-index-out-of-bounds in ../kernel/bpf/btf.c:3951:49
index 11 is out of range for type 'btf_field_info [11]'
CPU: 6 UID: 0 PID: 411 Comm: test_progs ...... 6.11.0-rc4+ #1
Tainted: [O]=OOT_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ...
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x70
dump_stack+0x10/0x20
ubsan_epilogue+0x9/0x40
__ubsan_handle_out_of_bounds+0x6f/0x80
? kallsyms_lookup_name+0x48/0xb0
btf_parse_fields+0x992/0xce0
map_create+0x591/0x770
__sys_bpf+0x229/0x2410
__x64_sys_bpf+0x1f/0x30
x64_sys_call+0x199/0x9f0
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7fea56f2cc5d
......
</TASK>
---[ end trace ]---
Fix it by checking the remaining info_cnt in btf_repeat_fields() before
repeating the btf fields.
Fixes: 64e8ee814819 ("bpf: look into the types of the fields of a struct type recursively.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241008071114.3718177-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes, 5 of which are c:stable. All singletons, about half of
which are MM"
* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
CREDITS: sort alphabetically by name
secretmem: disable memfd_secret() if arch cannot set direct map
.mailmap: update Fangrui's email
mm/huge_memory: check pmd_special() only after pmd_present()
resource, kunit: fix user-after-free in resource_test_region_intersects()
fs/proc/kcore.c: allow translation of physical memory addresses
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
device-dax: correct pgoff align in dax_set_mapping()
kthread: unpark only parked kthread
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
bcachefs: do not use PF_MEMALLOC_NORECLAIM
|
|
In resource_test_insert_resource(), the pointer is used in error message
after kfree(). This is user-after-free. To fix this, we need to call
kunit_add_action_or_reset() to schedule memory freeing after usage. But
kunit_add_action_or_reset() itself may fail and free the memory. So, its
return value should be checked and abort the test for failure. Then, we
found that other usage of kunit_add_action_or_reset() in
resource_test_region_intersects() needs to be fixed too. We fix all these
user-after-free bugs in this patch.
Link: https://lkml.kernel.org/r/20240930070611.353338-1-ying.huang@intel.com
Fixes: 99185c10d5d9 ("resource, kunit: add test case for region_intersects()")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Kees Bakker <kees@ijzerbout.nl>
Closes: https://lore.kernel.org/lkml/87ldzaotcg.fsf@yhuang6-desk2.ccr.corp.intel.com/
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Calling into kthread unparking unconditionally is mostly harmless when
the kthread is already unparked. The wake up is then simply ignored
because the target is not in TASK_PARKED state.
However if the kthread is per CPU, the wake up is preceded by a call
to kthread_bind() which expects the task to be inactive and in
TASK_PARKED state, which obviously isn't the case if it is unparked.
As a result, calling kthread_stop() on an unparked per-cpu kthread
triggers such a warning:
WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525
<TASK>
kthread_stop+0x17a/0x630 kernel/kthread.c:707
destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810
wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257
netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693
default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769
ops_exit_list net/core/net_namespace.c:178 [inline]
cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640
process_one_work kernel/workqueue.c:3231 [inline]
process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
worker_thread+0x86d/0xd70 kernel/workqueue.c:3393
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Fix this with skipping unecessary unparking while stopping a kthread.
Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org
Fixes: 5c25b5ff89f0 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.
Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home
Fixes: be68d63a139bd ("ring-buffer: Add ring_buffer_alloc_range()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
and thus ops.select() were skipped. Some schedulers were incorrectly
using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
it.
- Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()
- Fix error information clobbering during load
- Add missing __weak markers to BPF helper declarations
- Doc update
* tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Documentation: Update instructions for running example schedulers
sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
sched_ext: Improve error reporting during loading
sched_ext: Add __weak markers to BPF helper function decalarations
|
|
The key_free LSM hook has been removed.
Remove the corresponding BPF hook.
Avoid warnings during the build:
BTFIDS vmlinux
WARN: resolve_btfids: unresolved symbol bpf_lsm_key_free
Fixes: 5f8d28f6d7d5 ("lsm: infrastructure management of the key security blob")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241005-lsm-key_free-v1-1-42ea801dbd63@weissschuh.net
|
|
We need to free specs properly.
Fixes: 3d2786d65aaa ("bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20241007160958.607434-1-jolsa@kernel.org
|
|
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.
sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
|
|
was called
During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.
Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.
Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().
This will be used by sched_ext to fix ->select_task_rq() skip detection.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
value
This will be used to allow select_task_rq() to indicate whether
->select_task_rq() was called by modifying *wake_flags.
This makes try_to_wake_up() call all functions that take wake_flags with
WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
more consistent, and, as the flag is only tested by ->select_task_rq()
implementations, it doesn't cause any behavior differences.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.
As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 568894edbe48 ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")
|
|
When the BPF scheduler fails, ops.exit() allows rich error reporting through
scx_exit_info. Use scx.exit() path consistently for all failures which can
be caused by the BPF scheduler:
- scx_ops_error() is called after ops.init() and ops.cgroup_init() failure
to record error information.
- ops.init_task() failure now uses scx_ops_error() instead of pr_err().
- The err_disable path updated to automatically trigger scx_ops_error() to
cover cases that the error message hasn't already been generated and
always return 0 indicating init success so that the error is reported
through ops.exit().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix tp_printk command line option crashing the kernel
With the code that can handle a buffer from a previous boot, the
trace_check_vprintf() needed access to the delta of the address space
used by the old buffer and the current buffer. To do so, the
trace_array (tr) parameter was used. But when tp_printk is enabled on
the kernel command line, no trace buffer is used and the trace event
is sent directly to printk(). That meant the tr field of the iterator
descriptor was NULL, and since tp_printk still uses
trace_check_vprintf() it caused a NULL dereference.
- Add ptrace.h include to x86 ftrace file for completeness
- Fix rtla installation when done with out-of-tree build
- Fix the help messages in rtla that were incorrect
- Several fixes to fix races with the timerlat and hwlat code
Several locking issues were discovered with the coordination between
timerlat kthread creation and hotplug. As timerlat has callbacks from
hotplug code to start kthreads when CPUs come online. There are also
locking issues with grabbing the cpu_read_lock() and the locks within
timerlat.
* tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/hwlat: Fix a race during cpuhp processing
tracing/timerlat: Fix a race during cpuhp processing
tracing/timerlat: Drop interface_lock in stop_kthread()
tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
x86/ftrace: Include <asm/ptrace.h>
rtla: Fix the help text in osnoise and timerlat top tools
tools/rtla: Fix installation from out-of-tree build
tracing: Fix trace_check_vprintf() when tp_printk is used
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
"Fixes for issues introduced in this merge window: kobject memory leak,
unsupressed warning and possible lockup in new slub_kunit tests,
misleading code in kvfree_rcu_queue_batch()"
* tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
slub/kunit: skip test_kfree_rcu when the slub kunit test is built-in
mm, slab: suppress warnings in test_leak_destroy kunit test
rcu/kvfree: Refactor kvfree_rcu_queue_batch()
mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()
|
|
Pull close_range() fix from Al Viro:
"Fix the logic in descriptor table trimming"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
close_range(): fix the logics in descriptor table trimming
|
|
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.
While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.
The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.
The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.
With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.
Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.
Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639c9 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The cpuhp online/offline processing race also exists in percpu-mode hwlat
tracer in theory, apply the fix too. That is:
T1 | T2
[CPUHP_ONLINE] | cpu_device_down()
hwlat_hotplug_workfn() |
| cpus_write_lock()
| takedown_cpu(1)
| cpus_write_unlock()
[CPUHP_OFFLINE] |
cpus_read_lock() |
start_kthread(1) |
cpus_read_unlock() |
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
Fixes: ba998f7d9531 ("trace/hwlat: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:
```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
<TASK>
? __warn+0x7c/0x110
? debug_print_object+0x7d/0xb0
? report_bug+0xf1/0x1d0
? prb_read_valid+0x17/0x20
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? debug_print_object+0x7d/0xb0
? debug_print_object+0x7d/0xb0
? __pfx_timerlat_irq+0x10/0x10
__debug_object_init+0x110/0x150
hrtimer_init+0x1d/0x60
timerlat_main+0xab/0x2d0
? __pfx_timerlat_main+0x10/0x10
kthread+0xb7/0xe0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2d/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
```
After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:
T1 | T2
[CPUHP_ONLINE] | cpu_device_down()
osnoise_hotplug_workfn() |
| cpus_write_lock()
| takedown_cpu(1)
| cpus_write_unlock()
[CPUHP_OFFLINE] |
cpus_read_lock() |
start_kthread(1) |
cpus_read_unlock() |
To fix this, skip online processing if the CPU is already offline.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
stop_kthread() is the offline callback for "trace/osnoise:online", since
commit 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing
of kthread in stop_kthread()"), the following ABBA deadlock scenario is
introduced:
T1 | T2 [BP] | T3 [AP]
osnoise_hotplug_workfn() | work_for_cpu_fn() | cpuhp_thread_fun()
| _cpu_down() | osnoise_cpu_die()
mutex_lock(&interface_lock) | | stop_kthread()
| cpus_write_lock() | mutex_lock(&interface_lock)
cpus_read_lock() | cpuhp_kick_ap() |
As the interface_lock here in just for protecting the "kthread" field of
the osn_var, use xchg() instead to fix this issue. Also use
for_each_online_cpu() back in stop_per_cpu_kthreads() as it can take
cpu_read_lock() again.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-3-liwei391@huawei.com
Fixes: 5bfbcd1ee57b ("tracing/timerlat: Add interface_lock around clearing of kthread in stop_kthread()")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
osnoise_hotplug_workfn() is the asynchronous online callback for
"trace/osnoise:online". It may be congested when a CPU goes online and
offline repeatedly and is invoked for multiple times after a certain
online.
This will lead to kthread leak and timer corruption. Add a check
in start_kthread() to prevent this situation.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-2-liwei391@huawei.com
Fixes: c8895e271f79 ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
When the tp_printk kernel command line is used, the trace events go
directly to printk(). It is still checked via the trace_check_vprintf()
function to make sure the pointers of the trace event are legit.
The addition of reading buffers from previous boots required adding a
delta between the addresses of the previous boot and the current boot so
that the pointers in the old buffer can still be used. But this required
adding a trace_array pointer to acquire the delta offsets.
The tp_printk code does not provide a trace_array (tr) pointer, so when
the offsets were examined, a NULL pointer dereference happened and the
kernel crashed.
If the trace_array does not exist, just default the delta offsets to zero,
as that also means the trace event is not being read from a previous boot.
Link: https://lore.kernel.org/all/Zv3z5UsG_jsO9_Tb@aschofie-mobl2.lan/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241003104925.4e1b1fd9@gandalf.local.home
Fixes: 07714b4bb3f98 ("tracing: Handle old buffer mappings for event strings and functions")
Reported-by: Alison Schofield <alison.schofield@intel.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull generic unaligned.h cleanups from Al Viro:
"Get rid of architecture-specific <asm/unaligned.h> includes, replacing
them with a single generic <linux/unaligned.h> header file.
It's the second largest (after asm/io.h) class of asm/* includes, and
all but two architectures actually end up using exact same file.
Massage the remaining two (arc and parisc) to do the same and just
move the thing to from asm-generic/unaligned.h to linux/unaligned.h"
[ This is one of those things that we're better off doing outside the
merge window, and would only cause extra conflict noise if it was in
linux-next for the next release due to all the trivial #include line
updates. Rip off the band-aid. - Linus ]
* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move asm/unaligned.h to linux/unaligned.h
arc: get rid of private asm/unaligned.h
parisc: get rid of private asm/unaligned.h
|
|
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
|
|
rxq contains a pointer to the device from where
the redirect happened. Currently, the BPF program
that was executed after a redirect via BPF_MAP_TYPE_DEVMAP*
does not have it set.
This is particularly bad since accessing ingress_ifindex, e.g.
SEC("xdp")
int prog(struct xdp_md *pkt)
{
return bpf_redirect_map(&dev_redirect_map, 0, 0);
}
SEC("xdp/devmap")
int prog_after_redirect(struct xdp_md *pkt)
{
bpf_printk("ifindex %i", pkt->ingress_ifindex);
return XDP_PASS;
}
depends on access to rxq, so a NULL pointer gets dereferenced:
<1>[ 574.475170] BUG: kernel NULL pointer dereference, address: 0000000000000000
<1>[ 574.475188] #PF: supervisor read access in kernel mode
<1>[ 574.475194] #PF: error_code(0x0000) - not-present page
<6>[ 574.475199] PGD 0 P4D 0
<4>[ 574.475207] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
<4>[ 574.475217] CPU: 4 UID: 0 PID: 217 Comm: kworker/4:1 Not tainted 6.11.0-rc5-reduced-00859-g780801200300 #23
<4>[ 574.475226] Hardware name: Intel(R) Client Systems NUC13ANHi7/NUC13ANBi7, BIOS ANRPL357.0026.2023.0314.1458 03/14/2023
<4>[ 574.475231] Workqueue: mld mld_ifc_work
<4>[ 574.475247] RIP: 0010:bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c
<4>[ 574.475257] Code: cc cc cc cc cc cc cc 80 00 00 00 cc cc cc cc cc cc cc cc f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3 0f 1e fa 48 8b 57 20 <48> 8b 52 00 8b 92 e0 00 00 00 48 bf f8 a6 d5 c4 5d a0 ff ff be 0b
<4>[ 574.475263] RSP: 0018:ffffa62440280c98 EFLAGS: 00010206
<4>[ 574.475269] RAX: ffffa62440280cd8 RBX: 0000000000000001 RCX: 0000000000000000
<4>[ 574.475274] RDX: 0000000000000000 RSI: ffffa62440549048 RDI: ffffa62440280ce0
<4>[ 574.475278] RBP: ffffa62440280c98 R08: 0000000000000002 R09: 0000000000000001
<4>[ 574.475281] R10: ffffa05dc8b98000 R11: ffffa05f577fca40 R12: ffffa05dcab24000
<4>[ 574.475285] R13: ffffa62440280ce0 R14: ffffa62440549048 R15: ffffa62440549000
<4>[ 574.475289] FS: 0000000000000000(0000) GS:ffffa05f4f700000(0000) knlGS:0000000000000000
<4>[ 574.475294] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 574.475298] CR2: 0000000000000000 CR3: 000000025522e000 CR4: 0000000000f50ef0
<4>[ 574.475303] PKRU: 55555554
<4>[ 574.475306] Call Trace:
<4>[ 574.475313] <IRQ>
<4>[ 574.475318] ? __die+0x23/0x70
<4>[ 574.475329] ? page_fault_oops+0x180/0x4c0
<4>[ 574.475339] ? skb_pp_cow_data+0x34c/0x490
<4>[ 574.475346] ? kmem_cache_free+0x257/0x280
<4>[ 574.475357] ? exc_page_fault+0x67/0x150
<4>[ 574.475368] ? asm_exc_page_fault+0x26/0x30
<4>[ 574.475381] ? bpf_prog_5e13354d9cf5018a_prog_after_redirect+0x17/0x3c
<4>[ 574.475386] bq_xmit_all+0x158/0x420
<4>[ 574.475397] __dev_flush+0x30/0x90
<4>[ 574.475407] veth_poll+0x216/0x250 [veth]
<4>[ 574.475421] __napi_poll+0x28/0x1c0
<4>[ 574.475430] net_rx_action+0x32d/0x3a0
<4>[ 574.475441] handle_softirqs+0xcb/0x2c0
<4>[ 574.475451] do_softirq+0x40/0x60
<4>[ 574.475458] </IRQ>
<4>[ 574.475461] <TASK>
<4>[ 574.475464] __local_bh_enable_ip+0x66/0x70
<4>[ 574.475471] __dev_queue_xmit+0x268/0xe40
<4>[ 574.475480] ? selinux_ip_postroute+0x213/0x420
<4>[ 574.475491] ? alloc_skb_with_frags+0x4a/0x1d0
<4>[ 574.475502] ip6_finish_output2+0x2be/0x640
<4>[ 574.475512] ? nf_hook_slow+0x42/0xf0
<4>[ 574.475521] ip6_finish_output+0x194/0x300
<4>[ 574.475529] ? __pfx_ip6_finish_output+0x10/0x10
<4>[ 574.475538] mld_sendpack+0x17c/0x240
<4>[ 574.475548] mld_ifc_work+0x192/0x410
<4>[ 574.475557] process_one_work+0x15d/0x380
<4>[ 574.475566] worker_thread+0x29d/0x3a0
<4>[ 574.475573] ? __pfx_worker_thread+0x10/0x10
<4>[ 574.475580] ? __pfx_worker_thread+0x10/0x10
<4>[ 574.475587] kthread+0xcd/0x100
<4>[ 574.475597] ? __pfx_kthread+0x10/0x10
<4>[ 574.475606] ret_from_fork+0x31/0x50
<4>[ 574.475615] ? __pfx_kthread+0x10/0x10
<4>[ 574.475623] ret_from_fork_asm+0x1a/0x30
<4>[ 574.475635] </TASK>
<4>[ 574.475637] Modules linked in: veth br_netfilter bridge stp llc iwlmvm x86_pkg_temp_thermal iwlwifi efivarfs nvme nvme_core
<4>[ 574.475662] CR2: 0000000000000000
<4>[ 574.475668] ---[ end trace 0000000000000000 ]---
Therefore, provide it to the program by setting rxq properly.
Fixes: cb261b594b41 ("bpf: Run devmap xdp_prog on flush instead of bulk enqueue")
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Florian Kauer <florian.kauer@linutronix.de>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240911-devel-koalo-fix-ingress-ifindex-v4-1-5c643ae10258@linutronix.de
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. But there is a scenario
that is problematic. Say, if there is 1 cfs task and 1 rt task, before
85e511df3cec, update_deadline() will not trigger a reschedule, and after
85e511df3cec, since rq->nr_running is 2 and resched is true, a resched_curr()
would happen.
Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:
1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches
Restore its previous check criterion.
Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Honglei Wang <jameshongleiwang@126.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20240925085440.358138-1-yu.c.chen@intel.com
|
|
Meeting an unfinished DELAY_DEQUEUE treated entity in unthrottle_cfs_rq()
leads to a couple terminal scenarios. Finish it first, so ENQUEUE_WAKEUP
can proceed as it would have sans DELAY_DEQUEUE treatment.
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com>
Link: https://lore.kernel.org/r/7515d2e64c989b9e3b828a9e21bcd959b99df06a.camel@gmx.de
|
|
Improve readability of kvfree_rcu_queue_batch() function
in away that, after a first batch queuing, the loop is break
and success value is returned to a caller.
There is no reason to loop and check batches further as all
outstanding objects have already been picked and attached to
a certain batch to complete an offloading.
Fixes: 2b55d6a42d14 ("rcu/kvfree: Add kvfree_rcu_barrier() API")
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Closes: https://lore.kernel.org/lkml/ZvWUt2oyXRsvJRNc@pc636/T/
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Range propagation must not affect subreg_def marks, otherwise the
following example is rewritten by verifier incorrectly when
BPF_F_TEST_RND_HI32 flag is set:
0: call bpf_ktime_get_ns call bpf_ktime_get_ns
1: r0 &= 0x7fffffff after verifier r0 &= 0x7fffffff
2: w1 = w0 rewrites w1 = w0
3: if w0 < 10 goto +0 --------------> r11 = 0x2f5674a6 (r)
4: r1 >>= 32 r11 <<= 32 (r)
5: r0 = r1 r1 |= r11 (r)
6: exit; if w0 < 0xa goto pc+0
r1 >>= 32
r0 = r1
exit
(or zero extension of w1 at (2) is missing for architectures that
require zero extension for upper register half).
The following happens w/o this patch:
- r0 is marked as not a subreg at (0);
- w1 is marked as subreg at (2);
- w1 subreg_def is overridden at (3) by copy_register_state();
- w1 is read at (5) but mark_insn_zext() does not mark (2)
for zero extension, because w1 subreg_def is not set;
- because of BPF_F_TEST_RND_HI32 flag verifier inserts random
value for hi32 bits of (2) (marked (r));
- this random value is read at (5).
Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Closes: https://lore.kernel.org/bpf/7e2aa30a62d740db182c170fdd8f81c596df280d.camel@gmail.com
Link: https://lore.kernel.org/bpf/20240924210844.1758441-1-eddyz87@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- When sched_ext is in bypass mode (e.g. while disabling the BPF
scheduler), it was using one DSQ to implement global FIFO scheduling
as all it has to do is guaranteeing reasonable forward progress.
On multi-socket machines, this can lead to live-lock conditions under
certain workloads. Fixed by splitting the queue used for FIFO
scheduling per NUMA node. This required several preparation patches.
- Hotplug tests on powerpc could reliably trigger deadlock while
enabling a BPF scheduler.
This was caused by cpu_hotplug_lock nesting inside scx_fork_rwsem and
then CPU hotplug path trying to fork a new thread while holding
cpu_hotplug_lock.
Fixed by restructuring locking in enable and disable paths so that
the two locks are not coupled. This required several preparation
patches which also fixed a couple other issues in the enable path.
- A build fix for !CONFIG_SMP
- Userspace tooling sync and updates
* tag 'sched_ext-for-6.12-rc1-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Remove redundant p->nr_cpus_allowed checker
sched_ext: Decouple locks in scx_ops_enable()
sched_ext: Decouple locks in scx_ops_disable_workfn()
sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()
sched_ext: Enable scx_ops_init_task() separately
sched_ext: Fix SCX_TASK_INIT -> SCX_TASK_READY transitions in scx_ops_enable()
sched_ext: Initialize in bypass mode
sched_ext: Remove SCX_OPS_PREPPING
sched_ext: Relocate check_hotplug_seq() call in scx_ops_enable()
sched_ext: Use shorter slice while bypassing
sched_ext: Split the global DSQ per NUMA node
sched_ext: Relocate find_user_dsq()
sched_ext: Allow only user DSQs for scx_bpf_consume(), scx_bpf_dsq_nr_queued() and bpf_iter_scx_dsq_new()
scx_flatcg: Use a user DSQ for fallback instead of SCX_DSQ_GLOBAL
tools/sched_ext: Receive misc updates from SCX repo
sched_ext: Add __COMPAT helpers for features added during v6.12 devel cycle
sched_ext: Build fix for !CONFIG_SMP
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fix from Masami Hiramatsu:
- uprobes: fix kernel info leak via "[uprobes]" vma
Fix uprobes not to expose the uninitialized page for trampoline
buffer to user space, which can leak kernel info.
* tag 'probes-fixes-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
uprobes: fix kernel info leak via "[uprobes]" vma
|
|
Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.
Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().
The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.
That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.
It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.
So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in
60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.
Fixes: 60997c3d45d9 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
xol_add_vma() maps the uninitialized page allocated by __create_xol_area()
into userspace. On some architectures (x86) this memory is readable even
without VM_READ, VM_EXEC results in the same pgprot_t as VM_EXEC|VM_READ,
although this doesn't really matter, debugger can read this memory anyway.
Link: https://lore.kernel.org/all/20240929162047.GA12611@redhat.com/
Reported-by: Will Deacon <will@kernel.org>
Fixes: d4b3b6384f98 ("uprobes/core: Allocate XOL slots for uprobes use")
Cc: stable@vger.kernel.org
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"lockdep:
- Fix potential deadlock between lockdep and RCU (Zhiguo Niu)
- Use str_plural() to address Coccinelle warning (Thorsten Blum)
- Add debuggability enhancement (Luis Claudio R. Goncalves)
static keys & calls:
- Fix static_key_slow_dec() yet again (Peter Zijlstra)
- Handle module init failure correctly in static_call_del_module()
(Thomas Gleixner)
- Replace pointless WARN_ON() in static_call_module_notify() (Thomas
Gleixner)
<linux/cleanup.h>:
- Add usage and style documentation (Dan Williams)
rwsems:
- Move is_rwsem_reader_owned() and rwsem_owner() under
CONFIG_DEBUG_RWSEMS (Waiman Long)
atomic ops, x86:
- Redeclare x86_32 arch_atomic64_{add,sub}() as void (Uros Bizjak)
- Introduce the read64_nonatomic macro to x86_32 with cx8 (Uros
Bizjak)"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
* tag 'locking-urgent-2024-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS
jump_label: Fix static_key_slow_dec() yet again
static_call: Replace pointless WARN_ON() in static_call_module_notify()
static_call: Handle module init failure correctly in static_call_del_module()
locking/lockdep: Simplify character output in seq_line()
lockdep: fix deadlock issue between lockdep and rcu
lockdep: Use str_plural() to fix Coccinelle warning
cleanup: Add usage and style documentation
lockdep: suggest the fix for "lockdep bfs error:-1" on print_bfs_bug
locking/atomic/x86: Redeclare x86_32 arch_atomic64_{add,sub}() as void
locking/atomic/x86: Introduce the read64_nonatomic macro to x86_32 with cx8
|
|
Merge all pending locking commits into a single branch.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux
Pull module updates from Luis Chamberlain:
"There are a few fixes / cleanups from Vincent, Chunhui, and Petr, but
the most important part of this pull request is the Rust community
stepping up to help maintain both C / Rust code for future Rust module
support. We grow the set of modules maintainers by three now, and with
this hope to scale to help address what's needed to properly support
future Rust module support.
A lot of exciting stuff coming in future kernel releases.
This has been on linux-next for ~ 3 weeks now with no issues"
* tag 'modules-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux:
module: Refine kmemleak scanned areas
module: abort module loading when sysfs setup suffer errors
MAINTAINERS: scale modules with more reviewers
module: Clean up the description of MODULE_SIG_<type>
module: Split modules_install compression and in-kernel decompression
|
|
select_rq_task() already checked that 'p->nr_cpus_allowed > 1',
'p->nr_cpus_allowed == 1' checker in scx_select_cpu_dfl() is redundant.
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The enable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems.
For example, currently, there is a possible deadlock involving
scx_fork_rwsem and cpus_read_lock. cpus_read_lock has to nest inside
scx_fork_rwsem due to locking order existing in other subsystems. However,
there exists a dependency in the other direction during hotplug if hotplug
needs to fork a new task, which happens in some cases. This leads to the
following deadlock:
scx_ops_enable() hotplug
percpu_down_write(&cpu_hotplug_lock)
percpu_down_write(&scx_fork_rwsem)
block on cpu_hotplug_lock
kthread_create() waits for kthreadd
kthreadd blocks on scx_fork_rwsem
Note that this doesn't trigger lockdep because the hotplug side dependency
bounces through kthreadd.
With the preceding scx_cgroup_enabled change, this can be solved by
decoupling cpus_read_lock, which is needed for static_key manipulations,
from the other two locks.
- Move the first block of static_key manipulations outside of scx_fork_rwsem
and scx_cgroup_rwsem. This is now safe with the preceding
scx_cgroup_enabled change.
- Drop scx_cgroup_rwsem and scx_fork_rwsem between the two task iteration
blocks so that __scx_ops_enabled static_key enabling is outside the two
rwsems.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
|
|
The disable path uses three big locks - scx_fork_rwsem, scx_cgroup_rwsem and
cpus_read_lock. Currently, the locks are grabbed together which is prone to
locking order problems. With the preceding scx_cgroup_enabled change, we can
decouple them:
- As cgroup disabling no longer requires modifying a static_key which
requires cpus_read_lock(), no need to grab cpus_read_lock() before
grabbing scx_cgroup_rwsem.
- cgroup can now be independently disabled before tasks are moved back to
the fair class.
Relocate scx_cgroup_exit() invocation before scx_fork_rwsem is grabbed, drop
now unnecessary cpus_read_lock() and move static_key operations out of
scx_fork_rwsem. This decouples all three locks in the disable path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Link: http://lkml.kernel.org/r/8cd0ec0c4c7c1bc0119e61fbef0bee9d5e24022d.camel@linux.ibm.com
|
|
scx_tg_online()
If the BPF scheduler does not implement ops.cgroup_init(), scx_tg_online()
didn't set SCX_TG_INITED which meant that ops.cgroup_exit(), even if
implemented, won't be called from scx_tg_offline(). This is because
SCX_HAS_OP(cgroupt_init) is used to test both whether SCX cgroup operations
are enabled and ops.cgroup_init() exists.
Fix it by introducing a separate bool scx_cgroup_enabled to gate cgroup
operations and use SCX_HAS_OP(cgroup_init) only to test whether
ops.cgroup_init() exists. Make all cgroup operations consistently use
scx_cgroup_enabled to test whether cgroup operations are enabled.
scx_cgroup_enabled is added instead of using scx_enabled() to ease planned
locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_ops_init_task() and the follow-up scx_ops_enable_task() in the fork path
were gated by scx_enabled() test and thus __scx_ops_enabled had to be turned
on before the first scx_ops_init_task() loop in scx_ops_enable(). However,
if an external entity causes sched_class switch before the loop is complete,
tasks which are not initialized could be switched to SCX.
The following can be reproduced by running a program which keeps toggling a
process between SCHED_OTHER and SCHED_EXT using sched_setscheduler(2).
sched_ext: Invalid task state transition 0 -> 3 for fish[1623]
WARNING: CPU: 1 PID: 1650 at kernel/sched/ext.c:3392 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: simple (enabling)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by gating scx_ops_init_task() separately using
scx_ops_init_task_enabled. __scx_ops_enabled is now set after all tasks are
finished with scx_ops_init_task().
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
scx_ops_enable() has two task iteration loops. The first one calls
scx_ops_init_task() on every task and the latter switches the eligible ones
into SCX. The first loop left the tasks in SCX_TASK_INIT state and then the
second loop switched it into READY before switching the task into SCX.
The distinction between INIT and READY is only meaningful in the fork path
where it's used to tell whether the task finished forking so that we can
tell ops.exit_task() accordingly. Leaving task in INIT state between the two
loops is incosistent with the fork path and incorrect. The following can be
triggered by running a program which keeps toggling a task between
SCHED_OTHER and SCHED_SCX while enabling a task:
sched_ext: Invalid task state transition 1 -> 3 for fish[1526]
WARNING: CPU: 2 PID: 1615 at kernel/sched/ext.c:3393 scx_ops_enable_task+0x1a1/0x200
...
Sched_ext: qmap (enabling+all)
RIP: 0010:scx_ops_enable_task+0x1a1/0x200
...
switching_to_scx+0x13/0xa0
__sched_setscheduler+0x850/0xa50
do_sched_setscheduler+0x104/0x1c0
__x64_sys_sched_setscheduler+0x18/0x30
do_syscall_64+0x7b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fix it by transitioning to READY in the first loop right after
scx_ops_init_task() succeeds.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
|
|
scx_ops_enable() used preempt_disable() around the task iteration loop to
switch tasks into SCX to guarantee forward progress of the task which is
running scx_ops_enable(). However, in the gap between setting
__scx_ops_enabled and preeempt_disable(), an external entity can put tasks
including the enabling one into SCX prematurely, which can lead to
malfunctions including stalls.
The bypass mode can wrap the entire enabling operation and guarantee forward
progress no matter what the BPF scheduler does. Use the bypass mode instead
to guarantee forward progress while enabling.
While at it, release and regrab scx_tasks_lock between the two task
iteration locks in scx_ops_enable() for clarity as there is no reason to
keep holding the lock between them.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The distinction between SCX_OPS_PREPPING and SCX_OPS_ENABLING is not used
anywhere and only adds confusion. Drop SCX_OPS_PREPPING.
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
check_hotplug_seq() is used to detect CPU hotplug event which occurred while
the BPF scheduler is being loaded so that initialization can be retried if
CPU hotplug events take place before the CPU hotplug callbacks are online.
As such, the best place to call it is in the same cpu_read_lock() section
that enables the CPU hotplug ops. Currently, it is called in the next
cpus_read_lock() block in scx_ops_enable(). The side effect of this
placement is a small window in which hotplug sequence detection can trigger
unnecessarily, which isn't critical.
Move check_hotplug_seq() invocation to the same cpus_read_lock() block as
the hotplug operation enablement to close the window and get the invocation
out of the way for planned locking updates.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
|
|
no_llseek had been defined to NULL two years ago, in commit 868941b14441
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
While bypassing, tasks are scheduled in FIFO order which favors tasks that
hog CPUs. This can slow down e.g. unloading of the BPF scheduler. While
bypassing, guaranteeing timely forward progress is the main goal. There's no
point in giving long slices. Shorten the time slice used while bypassing
from 20ms to 5ms.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|
|
In the bypass mode, the global DSQ is used to schedule all tasks in simple
FIFO order. All tasks are queued into the global DSQ and all CPUs try to
execute tasks from it. This creates a lot of cross-node cacheline accesses
and scheduling across the node boundaries, and can lead to live-lock
conditions where the system takes tens of minutes to disable the BPF
scheduler while executing in the bypass mode.
Split the global DSQ per NUMA node. Each node has its own global DSQ. When a
task is dispatched to SCX_DSQ_GLOBAL, it's put into the global DSQ local to
the task's CPU and all CPUs in a node only consume its node-local global
DSQ.
This resolves a livelock condition which could be reliably triggered on an
2x EPYC 7642 system by running `stress-ng --race-sched 1024` together with
`stress-ng --workload 80 --workload-threads 10` while repeatedly enabling
and disabling a SCX scheduler.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
|