summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2025-05-03futex: Add basic infrastructure for local task local hashSebastian Andrzej Siewior4-18/+206
The futex hash is system wide and shared by all tasks. Each slot is hashed based on futex address and the VMA of the thread. Due to randomized VMAs (and memory allocations) the same logical lock (pointer) can end up in a different hash bucket on each invocation of the application. This in turn means that different applications may share a hash bucket on the first invocation but not on the second and it is not always clear which applications will be involved. This can result in high latency's to acquire the futex_hash_bucket::lock especially if the lock owner is limited to a CPU and can not be effectively PI boosted. Introduce basic infrastructure for process local hash which is shared by all threads of process. This hash will only be used for a PROCESS_PRIVATE FUTEX operation. The hashmap can be allocated via: prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_SET_SLOTS, num); A `num' of 0 means that the global hash is used instead of a private hash. Other values for `num' specify the number of slots for the hash and the number must be power of two, starting with two. The prctl() returns zero on success. This function can only be used before a thread is created. The current status for the private hash can be queried via: num = prctl(PR_FUTEX_HASH, PR_FUTEX_HASH_GET_SLOTS); which return the current number of slots. The value 0 means that the global hash is used. Values greater than 0 indicate the number of slots that are used. A negative number indicates an error. For optimisation, for the private hash jhash2() uses only two arguments the address and the offset. This omits the VMA which is always the same. [peterz: Use 0 for global hash. A bit shuffling and renaming. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-13-bigeasy@linutronix.de
2025-05-03futex: Create helper function to initialize a hash slotSebastian Andrzej Siewior1-5/+9
Factor out the futex_hash_bucket initialisation into a helpr function. The helper function will be used in a follow up patch implementing process private hash buckets. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-12-bigeasy@linutronix.de
2025-05-03futex: Introduce futex_q_lockptr_lock()Sebastian Andrzej Siewior4-6/+53
futex_lock_pi() and __fixup_pi_state_owner() acquire the futex_q::lock_ptr without holding a reference assuming the previously obtained hash bucket and the assigned lock_ptr are still valid. This isn't the case once the private hash can be resized and becomes invalid after the reference drop. Introduce futex_q_lockptr_lock() to lock the hash bucket recorded in futex_q::lock_ptr. The lock pointer is read in a RCU section to ensure that it does not go away if the hash bucket has been replaced and the old pointer has been observed. After locking the pointer needs to be compared to check if it changed. If so then the hash bucket has been replaced and the user has been moved to the new one and lock_ptr has been updated. The lock operation needs to be redone in this case. The locked hash bucket is not returned. A special case is an early return in futex_lock_pi() (due to signal or timeout) and a successful futex_wait_requeue_pi(). In both cases a valid futex_q::lock_ptr is expected (and its matching hash bucket) but since the waiter has been removed from the hash this can no longer be guaranteed. Therefore before the waiter is removed and a reference is acquired which is later dropped by the waiter to avoid a resize. Add futex_q_lockptr_lock() and use it. Acquire an additional reference in requeue_pi_wake_futex() and futex_unlock_pi() while the futex_q is removed, denote this extra reference in futex_q::drop_hb_ref and let the waiter drop the reference in this case. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-11-bigeasy@linutronix.de
2025-05-03futex: Decrease the waiter count before the unlock operationSebastian Andrzej Siewior2-5/+5
To support runtime resizing of the process private hash, it's required to not use the obtained hash bucket once the reference count has been dropped. The reference will be dropped after the unlock of the hash bucket. The amount of waiters is decremented after the unlock operation. There is no requirement that this needs to happen after the unlock. The increment happens before acquiring the lock to signal early that there will be a waiter. The waiter can avoid blocking on the lock if it is known that there will be no waiter. There is no difference in terms of ordering if the decrement happens before or after the unlock. Decrease the waiter count before the unlock operation. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-10-bigeasy@linutronix.de
2025-05-03futex: Acquire a hash reference in futex_wait_multiple_setup()Sebastian Andrzej Siewior1-0/+6
futex_wait_multiple_setup() changes task_struct::__state to !TASK_RUNNING and then enqueues on multiple futexes. Every futex_q_lock() acquires a reference on the global hash which is dropped later. If a rehash is in progress then the loop will block on mm_struct::futex_hash_bucket for the rehash to complete and this will lose the previously set task_struct::__state. Acquire a reference on the local hash to avoiding blocking on mm_struct::futex_hash_bucket. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-9-bigeasy@linutronix.de
2025-05-03futex: Create private_hash() get/put classPeter Zijlstra2-0/+20
This gets us: fph = futex_private_hash(key) /* gets fph and inc users */ futex_private_hash_get(fph) /* inc users */ futex_private_hash_put(fph) /* dec users */ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-8-bigeasy@linutronix.de
2025-05-03futex: Create futex_hash() get/put classPeter Zijlstra5-24/+30
This gets us: hb = futex_hash(key) /* gets hb and inc users */ futex_hash_get(hb) /* inc users */ futex_hash_put(hb) /* dec users */ Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-7-bigeasy@linutronix.de
2025-05-03futex: Create hb scopesPeter Zijlstra4-474/+493
Create explicit scopes for hb variables; almost pure re-indent. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-6-bigeasy@linutronix.de
2025-05-03futex: Pull futex_hash() out of futex_q_lock()Peter Zijlstra4-10/+8
Getting the hash bucket and queuing it are two distinct actions. In light of wanting to add a put hash bucket function later, untangle them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-5-bigeasy@linutronix.de
2025-05-03futex: Move futex_queue() into futex_wait_setup()Peter Zijlstra3-40/+41
futex_wait_setup() has a weird calling convention in order to return hb to use as an argument to futex_queue(). Mostly such that requeue can have an extra test in between. Reorder code a little to get rid of this and keep the hb usage inside futex_wait_setup(). [bigeasy: fixes] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250416162921.513656-4-bigeasy@linutronix.de
2025-05-02tracing: Do not take trace_event_sem in print_event_fields()Steven Rostedt1-2/+2
On some paths in print_event_fields() it takes the trace_event_sem for read, even though it should always be held when the function is called. Remove the taking of that mutex and add a lockdep_assert_held_read() to make sure the trace_event_sem is held when print_event_fields() is called. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250501224128.0b1f0571@batman.local.home Fixes: 80a76994b2d88 ("tracing: Add "fields" option to show raw trace event fields") Reported-by: syzbot+441582c1592938fccf09@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/6813ff5e.050a0220.14dd7d.001b.GAE@google.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski9-47/+57
Cross-merge networking fixes after downstream PR (net-6.15-rc5). No conflicts or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-01tracing: Fix trace_adjust_address() when there is no modules in scratch areaSteven Rostedt1-1/+3
The function trace_adjust_address() is used to map addresses of modules stored in the persistent memory and are also loaded in the current boot to return the current address for the module. If there's only one module entry, it will simply use that, otherwise it performs a bsearch of the entry array to find the modules to offset with. The issue is if there are no modules in the array. The code does not account for that and ends up referencing the first element in the array which does not exist and causes a crash. If nr_entries is zero, exit out early as if this was a core kernel address. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250501151909.65910359@gandalf.local.home Fixes: 35a380ddbc653 ("tracing: Show last module text symbols in the stacktrace") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01bpf: Allow XDP dev-bound programs to perform XDP_REDIRECT into mapsLorenzo Bianconi1-11/+16
In the current implementation if the program is dev-bound to a specific device, it will not be possible to perform XDP_REDIRECT into a DEVMAP or CPUMAP even if the program is running in the driver NAPI context and it is not attached to any map entry. This seems in contrast with the explanation available in bpf_prog_map_compatible routine. Fix the issue introducing __bpf_prog_map_compatible utility routine in order to avoid bpf_prog_is_dev_bound() check running bpf_check_tail_call() at program load time (bpf_prog_select_runtime()). Continue forbidding to attach a dev-bound program to XDP maps (BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_CPUMAP). Fixes: 3d76a4d3d4e59 ("bpf: XDP metadata RX kfuncs") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
2025-05-01ftrace: Fix NULL memory allocation checkColin Ian King1-1/+1
The check for a failed memory location is incorrectly checking the wrong level of pointer indirection by checking !filter_hash rather than !*filter_hash. Fix this. Cc: asami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250422221335.89896-1-colin.i.king@gmail.com Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01tracing: Fix oob write in trace_seq_to_buffer()Jeongjun Park1-2/+3
syzbot reported this bug: ================================================================== BUG: KASAN: slab-out-of-bounds in trace_seq_to_buffer kernel/trace/trace.c:1830 [inline] BUG: KASAN: slab-out-of-bounds in tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822 Write of size 4507 at addr ffff888032b6b000 by task syz.2.320/7260 CPU: 1 UID: 0 PID: 7260 Comm: syz.2.320 Not tainted 6.15.0-rc1-syzkaller-00301-g3bde70a2c827 #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:408 [inline] print_report+0xc3/0x670 mm/kasan/report.c:521 kasan_report+0xe0/0x110 mm/kasan/report.c:634 check_region_inline mm/kasan/generic.c:183 [inline] kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189 __asan_memcpy+0x3c/0x60 mm/kasan/shadow.c:106 trace_seq_to_buffer kernel/trace/trace.c:1830 [inline] tracing_splice_read_pipe+0x6be/0xdd0 kernel/trace/trace.c:6822 .... ================================================================== It has been reported that trace_seq_to_buffer() tries to copy more data than PAGE_SIZE to buf. Therefore, to prevent this, we should use the smaller of trace_seq_used(&iter->seq) and PAGE_SIZE as an argument. Link: https://lore.kernel.org/20250422113026.13308-1-aha310510@gmail.com Reported-by: syzbot+c8cd2d2c412b868263fb@syzkaller.appspotmail.com Fixes: 3c56819b14b0 ("tracing: splice support for tracing_pipe") Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-01bpf: Replace offsetof() with struct_size()Thorsten Blum1-2/+3
Compared to offsetof(), struct_size() provides additional compile-time checks for structs with flexible arrays (e.g., __must_be_array()). No functional changes intended. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250428210638.30219-2-thorsten.blum@linux.dev
2025-05-01sched_ext: Avoid NULL scx_root deref in __scx_exit()Andrea Righi1-4/+13
A sched_ext scheduler may trigger __scx_exit() from a BPF timer callback, where scx_root may not be safely dereferenced. This can lead to a NULL pointer dereference as shown below (triggered by scx_tickless): BUG: kernel NULL pointer dereference, address: 0000000000000330 ... CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) RIP: 0010:__scx_exit+0x2b/0x190 ... Call Trace: <IRQ> scx_bpf_get_idle_smtmask+0x59/0x80 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x35/0x1b6 ... bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 Fix this by checking for a valid scx_root and adding proper RCU protection. Fixes: 48e1267773866 ("sched_ext: Introduce scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-01sched_ext: Add RCU protection to scx_root in DSQ iteratorAndrea Righi1-1/+1
Using a DSQ iterators from a timer callback can trigger the following lockdep splat when accessing scx_root: ============================= WARNING: suspicious RCU usage 6.14.0-virtme #1 Not tainted ----------------------------- kernel/sched/ext.c:6907 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 no locks held by swapper/0/0. stack backtrace: CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.14.0-virtme #1 PREEMPT(full) Sched_ext: tickless (enabled+all) Call Trace: <IRQ> dump_stack_lvl+0x6f/0xb0 lockdep_rcu_suspicious.cold+0x4e/0xa3 bpf_iter_scx_dsq_new+0xb1/0xd0 bpf_prog_63f4fd1bccc101e7_dispatch_cpu+0x3e/0x156 bpf_prog_8320d4217989178c_dispatch_all_cpus+0x153/0x1b6 bpf_prog_97f847d871513f95_sched_timerfn+0x4c/0x264 ? hrtimer_run_softirq+0x4f/0xd0 bpf_timer_cb+0x7a/0x140 __hrtimer_run_queues+0x1f9/0x3a0 hrtimer_run_softirq+0x8c/0xd0 handle_softirqs+0xd3/0x3d0 __irq_exit_rcu+0x9a/0xc0 irq_exit_rcu+0xe/0x20 sysvec_apic_timer_interrupt+0x73/0x80 Add a proper dereference check to explicitly validate RCU-safe access to scx_root from rcu_read_lock() contexts and also from contexts that hold rcu_read_lock_bh(), such as timer callbacks. Fixes: cdf5a6faa8cf0 ("sched_ext: Move dsq_hash into scx_sched") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-01genirq/msi: Prevent NULL pointer dereference in msi_domain_debug_show()Andrew Jones1-1/+1
irq_domain_debug_show_one() calls msi_domain_debug_show() with a non-NULL domain pointer and a NULL irq_data pointer. irq_debug_show_data() calls it with a NULL domain pointer. The domain pointer is not used, but the irq_data pointer is required to be non-NULL and lacks a NULL pointer check. Add the missing NULL pointer check to ensure there is a non-NULL irq_data pointer in msi_domain_debug_show() before dereferencing it. [ tglx: Massaged change log ] Fixes: 01499ae673dc ("genirq/msi: Expose MSI message data in debugfs") Signed-off-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250430124836.49964-2-ajones@ventanamicro.com
2025-04-30kbuild: require gcc-8 and binutils-2.30Arnd Bergmann1-4/+0
Commit a3e8fe814ad1 ("x86/build: Raise the minimum GCC version to 8.1") raised the minimum compiler version as enforced by Kbuild to gcc-8.1 and clang-15 for x86. This is actually the same gcc version that has been discussed as the minimum for all architectures several times in the past, with little objection. A previous concern was the kernel for SLE15-SP7 needing to be built with gcc-7. As this ended up still using linux-6.4 and there is no plan for an SP8, this is no longer a problem. Change it for all architectures and adjust the documentation accordingly. A few version checks can be removed in the process. The binutils version 2.30 is the lowest version used in combination with gcc-8 on common distros, so use that as the corresponding minimum. Link: https://lore.kernel.org/lkml/20240925150059.3955569-32-ardb+git@google.com/ Link: https://lore.kernel.org/lkml/871q7yxrgv.wl-tiwai@suse.de/ Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2025-04-30Merge tag 'modules-6.15-rc5' of ↵Linus Torvalds1-25/+22
git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux Pull modules fixes from Petr Pavlu: "A single series to properly handle the module_kobject creation. This fixes a problem with missing /sys/module/<module>/drivers for built-in modules" * tag 'modules-6.15-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux: drivers: base: handle module_kobject creation kernel: globalize lookup_or_create_module_kobject() kernel: refactor lookup_or_create_module_kobject() kernel: param: rename locate_module_kobject
2025-04-30sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasksJohn Stultz1-0/+6
It was reported that in 6.12, smpboot_create_threads() was taking much longer then in 6.6. I narrowed down the call path to: smpboot_create_threads() -> kthread_create_on_cpu() -> kthread_bind() -> __kthread_bind_mask() ->wait_task_inactive() Where in wait_task_inactive() we were regularly hitting the queued case, which sets a 1 tick timeout, which when called multiple times in a row, accumulates quickly into a long delay. I noticed disabling the DELAY_DEQUEUE sched feature recovered the performance, and it seems the newly create tasks are usually sched_delayed and left on the runqueue. So in wait_task_inactive() when we see the task p->se.sched_delayed, manually dequeue the sched_delayed task with DEQUEUE_DELAYED, so we don't have to constantly wait a tick. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: peter-yc.chang@mediatek.com Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250429150736.3778580-1-jstultz@google.com
2025-04-30alarmtimer: Switch spin_{lock,unlock}_irqsave() to guardsSu Hui1-48/+30
Using guard/scoped_guard() to simplify code. Using guard() to remove 'goto unlock' label is neater especially. [ tglx: Brought back the scoped_guard()'s which were dropped in v2 and simplified alarmtimer_rtc_add_device() ] Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250430032734.2079290-4-suhui@nfschina.com
2025-04-30alarmtimer: Remove dead return value in clock2alarm()Su Hui1-3/+3
'clockid' can only be ALARM_REALTIME and ALARM_BOOTTIME. It's impossible to return -1 and callers never check the return value. Only alarm_clock_get_timespec(), alarm_clock_get_ktime(), alarm_timer_create() and alarm_timer_nsleep() call clock2alarm(). These callers use clockid_to_kclock() to get 'struct k_clock', which ensures that clock2alarm() never returns -1. Remove the impossible -1 return value, and add a warning to notify about any future misuse of this function. Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250430032734.2079290-3-suhui@nfschina.com
2025-04-30time/jiffies: Change register_refined_jiffies() to void __initSu Hui1-4/+1
register_refined_jiffies() is only used in setup code and always returns 0. Mark it as __init to save some bytes and change it to void. Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250430032734.2079290-2-suhui@nfschina.com
2025-04-29ima: kexec: move IMA log copy from kexec load to executeSteven Chen1-1/+10
The IMA log is currently copied to the new kernel during kexec 'load' using ima_dump_measurement_list(). However, the IMA measurement list copied at kexec 'load' may result in loss of IMA measurements records that only occurred after the kexec 'load'. Move the IMA measurement list log copy from kexec 'load' to 'execute' Make the kexec_segment_size variable a local static variable within the file, so it can be accessed during both kexec 'load' and 'execute'. Define kexec_post_load() as a wrapper for calling ima_kexec_post_load() and machine_kexec_post_load(). Replace the existing direct call to machine_kexec_post_load() with kexec_post_load(). When there is insufficient memory to copy all the measurement logs, copy as much of the measurement list as possible. Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Signed-off-by: Steven Chen <chenste@linux.microsoft.com> Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2025-04-29ima: kexec: skip IMA segment validation after kexec soft rebootSteven Chen1-0/+22
Currently, the function kexec_calculate_store_digests() calculates and stores the digest of the segment during the kexec_file_load syscall, where the IMA segment is also allocated. Later, the IMA segment will be updated with the measurement log at the kexec execute stage when a kexec reboot is initiated. Therefore, the digests should be updated for the IMA segment in the normal case. The problem is that the content of memory segments carried over to the new kernel during the kexec systemcall can be changed at kexec 'execute' stage, but the size and the location of the memory segments cannot be changed at kexec 'execute' stage. To address this, skip the calculation and storage of the digest for the IMA segment in kexec_calculate_store_digests() so that it is not added to the purgatory_sha_regions. With this change, the IMA segment is not included in the digest calculation, storage, and verification. Cc: Eric Biederman <ebiederm@xmission.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Steven Chen <chenste@linux.microsoft.com> Reviewed-by: Stefan Berger <stefanb@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm [zohar@linux.ibm.com: Fixed Signed-off-by tag to match author's email ] Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2025-04-29kexec: define functions to map and unmap segmentsSteven Chen1-0/+54
Implement kimage_map_segment() to enable IMA to map the measurement log list to the kimage structure during the kexec 'load' stage. This function gathers the source pages within the specified address range, and maps them to a contiguous virtual address range. This is a preparation for later usage. Implement kimage_unmap_segment() for unmapping segments using vunmap(). Cc: Eric Biederman <ebiederm@xmission.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Co-developed-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Tushar Sugandhi <tusharsu@linux.microsoft.com> Signed-off-by: Steven Chen <chenste@linux.microsoft.com> Acked-by: Baoquan He <bhe@redhat.com> Tested-by: Stefan Berger <stefanb@linux.ibm.com> # ppc64/kvm Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2025-04-29sched_ext: Clean up SCX_EXIT_NONE handling in scx_disable_workfn()Tejun Heo1-6/+2
With the global states and disable machinery moved into scx_sched, scx_disable_workfn() can only be scheduled and run for the specific scheduler instance. This makes it impossible for scx_disable_workfn() to see SCX_EXIT_NONE. Turn that condition into WARN_ON_ONCE(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move disable machinery into scx_schedTejun Heo1-48/+25
Because disable can be triggered from any place and the scheduler cannot be trusted, disable path uses an irq_work to bounce and a kthread_work which is executed on an RT helper kthread to perform disable. These must be per scheduler instance to guarantee forward progress. Move them into scx_sched. - If an scx_sched is accessible, its helper kthread is always valid making the `helper` check in schedule_scx_disable_work() unnecessary. As the function becomes trivial after the removal of the test, inline it. - scx_create_rt_helper() has only one user - creation of the disable helper kthread. Inline it into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move event_stats_cpu into scx_schedTejun Heo1-42/+58
The event counters are going to become per scheduler instance. Move event_stats_cpu into scx_sched. - [__]scx_add_event() are updated to take @sch. While at it, add missing parentheses around @cnt expansions. - scx_read_events() is updated to take @sch. - scx_bpf_events() accesses scx_root under RCU read lock. v2: - Replace stale scx_bpf_get_event_stat() reference in a comment with scx_bpf_events(). - Trivial goto label rename. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_read_events()Tejun Heo1-19/+26
In prepration of moving event_stats_cpu into scx_sched, factor out scx_read_events() out of scx_bpf_events() and update the in-kernel users - scx_attr_events_show() and scx_dump_state() - to use scx_read_events() instead of scx_bpf_events(). No observable behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Relocate scx_event_stats definitionTejun Heo1-57/+57
In prepration of moving event_stats_cpu into scx_sched, move scx_event_stats definitions above scx_sched definition. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move global_dsqs into scx_schedTejun Heo1-46/+49
Global DSQs are going to become per scheduler instance. Move global_dsqs into scx_sched. find_global_dsq() already takes a task_struct pointer as an argument and should later be able to determine the scx_sched to use from that. For now, assume scx_root. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Move dsq_hash into scx_schedTejun Heo1-40/+82
User DSQs are going to become per scheduler instance. Move dsq_hash into scx_sched. This shifts the code that assumes scx_root to be the only scx_sched instance up the call stack but doesn't remove them yet. v2: Add missing rcu_read_lock() in scx_bpf_destroy_dsq() as per Andrea. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Factor out scx_alloc_and_add_sched()Tejun Heo1-22/+36
More will be moved into scx_sched. Factor out the allocation and kobject addition path into scx_alloc_and_add_sched(). Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Inline create_dsq() into scx_bpf_create_dsq()Tejun Heo1-24/+18
create_dsq() is only used by scx_bpf_create_dsq() and the separation gets in the way of making dsq_hash per scx_sched. Inline it into scx_bpf_create_dsq(). While at it, add unlikely() around SCX_DSQ_FLAG_BUILTIN test. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Use dynamic allocation for scx_schedTejun Heo1-66/+91
To prepare for supporting multiple schedulers, make scx_sched allocated dynamically. scx_sched->kobj is now an embedded field and the kobj's lifetime determines the lifetime of the containing scx_sched. - Enable path is updated so that kobj init and addition are performed later. - scx_sched freeing is initiated in scx_kobj_release() and also goes through an rcu_work so that scx_root can be accessed from an unsynchronized path - scx_disable(). - sched_ext_ops->priv is added and used to point to scx_sched instance created for the ops instance. This is used by bpf_scx_unreg() to determine the scx_sched instance to disable and put. No behavior changes intended. v2: Andrea reported kernel oops due to scx_bpf_unreg() trying to deref NULL scx_root after scheduler init failure. sched_ext_ops->priv added so that scx_bpf_unreg() can always find the scx_sched instance to unregister even if it failed early during init. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Avoid NULL scx_root deref through SCX_HAS_OP()Tejun Heo1-1/+10
SCX_HAS_OP() tests scx_root->has_op bitmap. The bitmap is currently in a statically allocated struct scx_sched and initialized while loading the BPF scheduler and cleared while unloading, and thus can be tested anytime. However, scx_root will be switched to dynamic allocation and thus won't always be deferenceable. Most usages of SCX_HAS_OP() are already protected by scx_enabled() either directly or indirectly (e.g. through a task which is on SCX). However, there are a couple places that could try to dereference NULL scx_root. Update them so that scx_root is guaranteed to be valid before SCX_HAS_OP() is called. - In handle_hotplug(), test whether scx_root is NULL before doing anything else. This is safe because scx_root updates will be protected by cpus_read_lock(). - In scx_tg_offline(), test scx_cgroup_enabled before invoking SCX_HAS_OP(), which should guarnatee that scx_root won't turn NULL. This is also in line with other cgroup operations. As the code path is synchronized against scx_cgroup_init/exit() through scx_cgroup_rwsem, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29sched_ext: Introduce scx_schedTejun Heo2-115/+131
To support multiple scheduler instances, collect some of the global variables that should be specific to a scheduler instance into the new struct scx_sched. scx_root is the root scheduler instance and points to a static instance of struct scx_sched. Except for an extra dereference through the scx_root pointer, this patch makes no functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Acked-by: Changwoo Min <changwoo@igalia.com>
2025-04-29Merge branch 'for-6.15-fixes' into for-6.16Tejun Heo1-8/+2
To receive e38be1c7647c ("sched_ext: Fix rq lock state in hotplug ops") to avoid conflicts with scx_sched related patches pending for for-6.16.
2025-04-29sched_ext: Fix rq lock state in hotplug opsAndrea Righi1-2/+2
The ops.cpu_online() and ops.cpu_offline() callbacks incorrectly assume that the rq involved in the operation is locked, which is not the case during hotplug, triggering the following warning: WARNING: CPU: 1 PID: 20 at kernel/sched/sched.h:1504 handle_hotplug+0x280/0x340 Fix by not tracking the target rq as locked in the context of ops.cpu_online() and ops.cpu_offline(). Fixes: 18853ba782bef ("sched_ext: Track currently locked rq") Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrea Righi <arighi@nvidia.com> Tested-by: Changwoo Min <changwoo@igalia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-29entry: Inline syscall_exit_to_user_mode()Charlie Jenkins1-48/+1
Similar to commit 221a164035fd ("entry: Move syscall_enter_from_user_mode() to header file"), move syscall_exit_to_user_mode() to the header file as well. Testing was done with the byte-unixbench syscall benchmark (which calls getpid) and QEMU. On riscv I measured a 7.09246% improvement, on x86 a 2.98843% improvement, on loongarch a 6.07954% improvement, and on s390 a 11.1328% improvement. The Intel bot also reported "kernel test robot noticed a 1.9% improvement of stress-ng.seek.ops_per_sec". Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/all/20250320-riscv_optimize_entry-v6-4-63e187e26041@rivosinc.com Link: https://lore.kernel.org/linux-riscv/202502051555.85ae6844-lkp@intel.com/
2025-04-28Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4Alexei Starovoitov11-67/+73
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-28PM: EM: Fix potential division-by-zero error in em_compute_costs()Yaxiong Tian1-0/+4
When the device is of a non-CPU type, table[i].performance won't be initialized in the previous em_init_performance(), resulting in division by zero when calculating costs in em_compute_costs(). Since the 'cost' algorithm is only used for EAS energy efficiency calculations and is currently not utilized by other device drivers, we should add the _is_cpu_device(dev) check to prevent this division-by-zero issue. Fixes: 1b600da51073 ("PM: EM: Optimize em_cpu_energy() and remove division") Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/tencent_7F99ED4767C1AF7889D0D8AD50F34859CE06@qq.com Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-28timekeeping: Prevent coarse clocks going backwardsThomas Gleixner2-10/+44
Lei Chen raised an issue with CLOCK_MONOTONIC_COARSE seeing time inconsistencies. Lei tracked down that this was being caused by the adjustment: tk->tkr_mono.xtime_nsec -= offset; which is made to compensate for the unaccumulated cycles in offset when the multiplicator is adjusted forward, so that the non-_COARSE clockids don't see inconsistencies. However, the _COARSE clockid getter functions use the adjusted xtime_nsec value directly and do not compensate the negative offset via the clocksource delta multiplied with the new multiplicator. In that case the caller can observe time going backwards in consecutive calls. By design, this negative adjustment should be fine, because the logic run from timekeeping_adjust() is done after it accumulated approximately multiplicator * interval_cycles into xtime_nsec. The accumulated value is always larger then the mult_adj * offset value, which is subtracted from xtime_nsec. Both operations are done together under the tk_core.lock, so the net change to xtime_nsec is always always be positive. However, do_adjtimex() calls into timekeeping_advance() as well, to apply the NTP frequency adjustment immediately. In this case, timekeeping_advance() does not return early when the offset is smaller then interval_cycles. In that case there is no time accumulated into xtime_nsec. But the subsequent call into timekeeping_adjust(), which modifies the multiplicator, subtracts from xtime_nsec to correct for the new multiplicator. Here because there was no accumulation, xtime_nsec becomes smaller than before, which opens a window up to the next accumulation, where the _COARSE clockid getters, which don't compensate for the offset, can observe the inconsistency. This has been tried to be fixed by forwarding the timekeeper in the case that adjtimex() adjusts the multiplier, which resets the offset to zero: 757b000f7b93 ("timekeeping: Fix possible inconsistencies in _COARSE clockids") That works correctly, but unfortunately causes a regression on the adjtimex() side. There are two issues: 1) The forwarding of the base time moves the update out of the original period and establishes a new one. 2) The clearing of the accumulated NTP error is changing the behaviour as well. User-space expects that multiplier/frequency updates are in effect, when the syscall returns, so delaying the update to the next tick is not solving the problem either. Commit 757b000f7b93 was reverted so that the established expectations of user space implementations (ntpd, chronyd) are restored, but that obviously brought the inconsistencies back. One of the initial approaches to fix this was to establish a separate storage for the coarse time getter nanoseconds part by calculating it from the offset. That was dropped on the floor because not having yet another state to maintain was simpler. But given the result of the above exercise, this solution turns out to be the right one. Bring it back in a slightly modified form. Thus introduce timekeeper::coarse_nsec and store that nanoseconds part in it, switch the time getter functions and the VDSO update to use that value. coarse_nsec is set on operations which forward or initialize the timekeeper and after time was accumulated during a tick. If there is no accumulation the timestamp is unchanged. This leaves the adjtimex() behaviour unmodified and prevents coarse time from going backwards. [ jstultz: Simplified the coarse_nsec calculation and kept behavior so coarse clockids aren't adjusted on each inter-tick adjtimex call, slightly reworked the comments and commit message ] Fixes: da15cfdae033 ("time: Introduce CLOCK_REALTIME_COARSE") Reported-by: Lei Chen <lei.chen@smartx.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/all/20250419054706.2319105-1-jstultz@google.com Closes: https://lore.kernel.org/lkml/20250310030004.3705801-1-lei.chen@smartx.com/
2025-04-26Merge tag 'sched-urgent-2025-04-26' of ↵Linus Torvalds1-3/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix sporadic crashes in dequeue_entities() due to ... bad math. [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of complex math being correct it could have de-escalated a crash into a warning, but that's for a different patch ]" * tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
2025-04-26Merge tag 'perf-urgent-2025-04-26' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc perf events fixes from Ingo Molnar: - Use POLLERR for events in error state, instead of the ambiguous POLLHUP error value - Fix non-sampling (counting) events on certain x86 platforms * tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix non-sampling (counting) events on certain x86 platforms perf/core: Change to POLLERR for pinned events with error
2025-04-26sched/eevdf: Fix se->slice being set to U64_MAX and resulting crashOmar Sandoval1-3/+1
There is a code path in dequeue_entities() that can set the slice of a sched_entity to U64_MAX, which sometimes results in a crash. The offending case is when dequeue_entities() is called to dequeue a delayed group entity, and then the entity's parent's dequeue is delayed. In that case: 1. In the if (entity_is_task(se)) else block at the beginning of dequeue_entities(), slice is set to cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX. 2. The first for_each_sched_entity() loop dequeues the entity. 3. If the entity was its parent's only child, then the next iteration tries to dequeue the parent. 4. If the parent's dequeue needs to be delayed, then it breaks from the first for_each_sched_entity() loop _without updating slice_. 5. The second for_each_sched_entity() loop sets the parent's ->slice to the saved slice, which is still U64_MAX. This throws off subsequent calculations with potentially catastrophic results. A manifestation we saw in production was: 6. In update_entity_lag(), se->slice is used to calculate limit, which ends up as a huge negative number. 7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit is negative, vlag > limit, so se->vlag is set to the same huge negative number. 8. In place_entity(), se->vlag is scaled, which overflows and results in another huge (positive or negative) number. 9. The adjusted lag is subtracted from se->vruntime, which increases or decreases se->vruntime by a huge number. 10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which incorrectly returns false because the vruntime is so far from the other vruntimes on the queue, causing the (vruntime - cfs_rq->min_vruntime) * load calulation to overflow. 11. Nothing appears to be eligible, so pick_eevdf() returns NULL. 12. pick_next_entity() tries to dereference the return value of pick_eevdf() and crashes. Dumping the cfs_rq states from the core dumps with drgn showed tell-tale huge vruntime ranges and bogus vlag values, and I also traced se->slice being set to U64_MAX on live systems (which was usually "benign" since the rest of the runqueue needed to be in a particular state to crash). Fix it in dequeue_entities() by always setting slice from the first non-empty cfs_rq. Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy") Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com