summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
4 daysMerge tag 'trace-v7.1-rc1' of ↵HEADmasterLinus Torvalds4-6/+14
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix inverted check of registering the stats for branch tracing When calling register_stat_tracer() which returns zero on success and negative on error, the callers were checking the return of zero as an error and printing a warning message. Because this was just a normal printk() message and not a WARN(), it wasn't caught in any testing. Fix the check to print the warning message when an error actually happens. - Fix a typo in a comment in tracepoint.h - Limit the size of event probes to 3K in size It is possible to create a dynamic event probe via the tracefs system that is greater than the max size of an event that the ring buffer can hold. This basically causes the event to become useless. Limit the size of an event probe to be 3K as that should be large enough to handle any dynamic events being created, and fits within the PAGE_SIZE sub-buffers of the ring buffer. * tag 'trace-v7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/probes: Limit size of event probe to 3K tracepoint: Fix typo in tracepoint.h comment tracing: branch: Fix inverted check on stat tracer registration
5 daystracing/probes: Limit size of event probe to 3KSteven Rostedt2-1/+9
There currently isn't a max limit an event probe can be. One could make an event greater than PAGE_SIZE, which makes the event useless because if it's bigger than the max event that can be recorded into the ring buffer, then it will never be recorded. A event probe should never need to be greater than 3K, so make that the max size. As long as the max is less than the max that can be recorded onto the ring buffer, it should be fine. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: 93ccae7a22274 ("tracing/kprobes: Support basic types on dynamic events") Link: https://patch.msgid.link/20260428122302.706610ba@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
5 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds194-2872/+2800
Pull kvm updates from Paolo Bonzini: "On top of a lot of Arm fixes, this includes a massive rename of types and variables in tools/testing/selftests/kvm - these were unnecessarily different from what the kernel uses, so they're being made consistent. arm64: - Allow tracing for non-pKVM, which was accidentally disabled when the series was merged - Rationalise the way the pKVM hypercall ranges are defined by using the same mechanism as already used for the vcpu_sysreg enum - Enforce that SMCCC function numbers relayed by the pKVM proxy are actually compliant with the specification - Fix a couple of feature to idreg mappings which resulted in the wrong sanitisation being applied - Fix the GICD_IIDR revision number field that could never been written correctly by userspace - Make kvm_vcpu_initialized() correctly use its parameter instead of relying on the surrounding context - Enforce correct ordering in __pkvm_init_vcpu(), plugging a potential pin leak at the same time - Move __pkvm_init_finalise() to a less dangerous spot, avoiding future problems - Restore functional userspace irqchip support after a four year breakage (last functional kernel was 5.18...) - Spelling fixes Selftests: - Rename types across all KVM selftests to more closely align with types used in the kernel: vm_vaddr_t -> gva_t vm_paddr_t -> gpa_t uint64_t -> u64 uint32_t -> u32 uint16_t -> u16 uint8_t -> u8 int64_t -> s64 int32_t -> s32 int16_t -> s16 int8_t -> s8 - Fix Loongarch compilation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (31 commits) KVM: selftests: Add check_steal_time_uapi() implementation for LoongArch KVM: arm64: Wake-up from WFI when iqrchip is in userspace KVM: arm64: Fix initialisation order in __pkvm_init_finalise() KVM: arm64: Fix pin leak and publication ordering in __pkvm_init_vcpu() KVM: arm64: Fix kvm_vcpu_initialized() macro parameter KVM: arm64: Fix FEAT_SPE_FnE to use PMSIDR_EL1.FnE, not PMSVer KVM: arm64: Fix typo in feature check comments KVM: arm64: Fix FEAT_Debugv8p9 to check DebugVer, not PMUVer KVM: arm64: Reject non compliant SMCCC function calls in pKVM KVM: arm64: vgic: Fix IIDR revision field extracted from wrong value KVM: selftests: Replace "paddr" with "gpa" throughout KVM: selftests: Replace "u64 nested_paddr" with "gpa_t l2_gpa" KVM: selftests: Replace "u64 gpa" with "gpa_t" throughout KVM: selftests: Replace "vaddr" with "gva" throughout KVM: selftests: Clarify that arm64's inject_uer() takes a host PA, not a guest PA KVM: selftests: Rename translate_to_host_paddr() => translate_hva_to_hpa() KVM: selftests: Rename vm_vaddr_populate_bitmap() => vm_populate_gva_bitmap() KVM: selftests: Rename vm_vaddr_unused_gap() => vm_unused_gva_gap() KVM: selftests: Drop "vaddr_" from APIs that allocate memory for a given VM KVM: selftests: Use u8 instead of uint8_t ...
6 daysMerge tag 'sched_ext-for-7.1-rc1-fixes' of ↵Linus Torvalds11-150/+436
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: "The merge window pulled in the cgroup sub-scheduler infrastructure, and new AI reviews are accelerating bug reporting and fixing - hence the larger than usual fixes batch: - Use-after-frees during scheduler load/unload: - The disable path could free the BPF scheduler while deferred irq_work / kthread work was still in flight - cgroup setter callbacks read the active scheduler outside the rwsem that synchronizes against teardown Fix both, and reuse the disable drain in the enable error paths so the BPF JIT page can't be freed under live callbacks. - Several BPF op invocations didn't tell the framework which runqueue was already locked, so helper kfuncs that re-acquire the runqueue by CPU could deadlock on the held lock Fix the affected callsites, including recursive parent-into-child dispatch. - The hardlockup notifier ran from NMI but eventually took a non-NMI-safe lock. Bounce it through irq_work. - A handful of bugs in the new sub-scheduler hierarchy: - helper kfuncs hard-coded the root instead of resolving the caller's scheduler - the enable error path tried to disable per-task state that had never been initialized, and leaked cpus_read_lock on the way out - a sysfs object was leaked on every load/unload - the dispatch fast-path used the root scheduler instead of the task's - a couple of CONFIG #ifdef guards were misclassified - Verifier-time hardening: BPF programs of unrelated struct_ops types (e.g. tcp_congestion_ops) could call sched_ext kfuncs - a semantic bug and, once sub-sched was enabled, a KASAN out-of-bounds read. Now rejected at load. Plus a few NULL and cross-task argument checks on sched_ext kfuncs, and a selftest covering the new deny. - rhashtable (Herbert): restore the insecure_elasticity toggle and bounce the deferred-resize kick through irq_work to break a lock-order cycle observable from raw-spinlock callers. sched_ext's scheduler-instance hash is the first user of both. - The bypass-mode load balancer used file-scope cpumasks; with multiple scheduler instances now possible, those raced. Move to per-instance cpumasks, plus a follow-up to skip tasks whose recorded CPU is stale relative to the new owning runqueue. - Smaller fixes: - a dispatch queue's first-task tracking misbehaved when a parked iterator cursor sat in the list - the runqueue's next-class wasn't promoted on local-queue enqueue, leaving an SCX task behind RT in edge cases - the reference qmap scheduler stopped erroring on legitimate cross-scheduler task-storage misses" * tag 'sched_ext-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: (26 commits) sched_ext: Fix scx_flush_disable_work() UAF race sched_ext: Call wakeup_preempt() in local_dsq_post_enq() sched_ext: Release cpus_read_lock on scx_link_sched() failure in root enable sched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtime sched_ext: Refuse cross-task select_cpu_from_kfunc calls sched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHED sched_ext: Make bypass LB cpumasks per-scheduler sched_ext: Pass held rq to SCX_CALL_OP() for core_sched_before sched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_task sched_ext: Save and restore scx_locked_rq across SCX_CALL_OP sched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() FIFO-tail sched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / scx_bpf_dsq_nr_queued() sched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup setters sched_ext: Don't disable tasks in scx_sub_enable_workfn() abort path sched_ext: Skip tasks with stale task_rq in bypass_lb_cpu() sched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_new sched_ext: Unregister sub_kset on scheduler disable sched_ext: Defer scx_hardlockup() out of NMI sched_ext: sync disable_irq_work in bpf_scx_unreg() sched_ext: Fix local_dsq_post_enq() to use task's scheduler in sub-sched ...
6 daystracepoint: Fix typo in tracepoint.h commentSheng Che Peng1-1/+1
Change "my" to "may" in the description of subsystem configurations. Link: https://patch.msgid.link/20260422021819.1788091-1-synte4028@gmail.com Signed-off-by: Sheng Che Peng <synte4028@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
6 daystracing: branch: Fix inverted check on stat tracer registrationBreno Leitao1-4/+4
init_annotated_branch_stats() and all_annotated_branch_stats() check the return value of register_stat_tracer() with "if (!ret)", but register_stat_tracer() returns 0 on success and a negative errno on failure. The inverted check causes the warning to be printed on every successful registration, e.g.: Warning: could not register annotated branches stats while leaving real failures silent. The initcall also returned a hard-coded 1 instead of the actual error. Invert the check and propagate ret so that the warning fires on real errors and the initcall reports the correct status. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: https://patch.msgid.link/20260420-tracing-v1-1-d8f4cd0d6af1@debian.org Fixes: 002bb86d8d42 ("tracing/ftrace: separate events tracing and stats tracing engine") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
6 dayssched_ext: Fix scx_flush_disable_work() UAF raceCheng-Yang Chou1-2/+7
scx_flush_disable_work() calls irq_work_sync() followed by kthread_flush_work() to ensure that the disable kthread work has fully completed before bpf_scx_unreg() frees the SCX scheduler. However, a concurrent scx_vexit() (e.g., triggered by a watchdog stall) creates a race window between scx_claim_exit() and irq_work_queue(): CPU A (scx_vexit (watchdog)) CPU B (bpf_scx_unreg) ---- ---- scx_claim_exit() atomic_try_cmpxchg(NONE->kind) stack_trace_save() vscnprintf() scx_disable() scx_claim_exit() -> FAIL scx_flush_disable_work() irq_work_sync() // no-op: not queued yet kthread_flush_work() // no-op: not queued yet kobject_put(&sch->kobj) -> free %sch irq_work_queue() -> UAF on %sch scx_disable_irq_workfn() kthread_queue_work() -> UAF The root cause is that CPU B's scx_flush_disable_work() returns after syncing an irq_work that has not yet been queued, while CPU A is still executing the code between scx_claim_exit() and irq_work_queue(). Loop until exit_kind reaches SCX_EXIT_DONE or SCX_EXIT_NONE, draining disable_irq_work and disable_work in each pass. This ensures that any work queued after the previous check is caught, while also correctly handling cases where no disable was triggered (e.g., the scx_sub_enable_workfn() abort path). Fixes: 510a27055446 ("sched_ext: sync disable_irq_work in bpf_scx_unreg()") Reported-by: https://sashiko.dev/#/patchset/20260424100221.32407-1-icheng%40nvidia.com Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
6 dayssched_ext: Call wakeup_preempt() in local_dsq_post_enq()Kuba Piecuch1-5/+39
There are several edge cases (see linked thread) where an IMMED task can be left lingering on a local DSQ if an RT task swoops in at the wrong time. All of these edge cases are due to rq->next_class being idle even after dispatching a task to rq's local DSQ. We should bump rq->next_class to &ext_sched_class as soon as we've inserted a task into the local DSQ. To optimize the common case of rq->next_class == &ext_sched_class, only call wakeup_preempt() if rq->next_class is below EXT. If next_class is EXT or above, wakeup_preempt() is a no-op anyway. This lets us also simplify the preempt_curr() logic a bit since wakeup_preempt() will call preempt_curr() for us if next_class is below EXT. Link: https://lore.kernel.org/all/DHZPHUFXB4N3.2RY28MUEWBNYK@google.com/ Signed-off-by: Kuba Piecuch <jpiecuch@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>
6 daysMerge tag 'xsa48x-7.1-tag' of ↵Linus Torvalds2-2/+13
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "XSA-485 and XSA-487 security patches" * tag 'xsa48x-7.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/privcmd: fix double free via VMA splitting Buffer overflow in drivers/xen/sys-hypervisor.c
6 daysMerge tag 'cgroup-for-7.1-rc1-fixes' of ↵Linus Torvalds5-24/+44
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix UAF race in psi pressure_write() against cgroup file release by extending cgroup_mutex coverage and ordering of->priv access after cgroup_kn_lock_live() - Fix integer overflow in rdmacg_try_charge() when usage equals INT_MAX by performing the increment in s64 - Fix asymmetric DL bandwidth accounting on cpuset attach rollback by recording the CPU used by dl_bw_alloc() so cancel_attach() returns the reservation to the same root domain - Fix nr_dying_subsys_* race that briefly showed 0 in cgroup.stat after rmdir by incrementing from kill_css() instead of offline_css() - Typo fix in cgroup-v2 documentation * tag 'cgroup-for-7.1-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: docs: cgroup: fix typo 'protetion' -> 'protection' cgroup: Increment nr_dying_subsys_* from rmdir context cgroup/cpuset: record DL BW alloc CPU for attach rollback cgroup/rdma: fix integer overflow in rdmacg_try_charge() sched/psi: fix race between file release and pressure write
6 daysMerge tag 'fs_for_v7.1-rc2' of ↵Linus Torvalds5-13/+19
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull isofs and udf fixes from Jan Kara: "Several isofs and udf fixes" * tag 'fs_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: docs: isofs: replace dead ECMA-119 FTP link udf: reject descriptors with oversized CRC length isofs: use QSTR_LEN() in isofs_cmp isofs: validate block number from NFS file handle in isofs_export_iget isofs: validate Rock Ridge CE continuation extent against volume size
7 daysMerge tag 'fsnotify_for_v7.1-rc2' of ↵Linus Torvalds4-12/+50
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify fixes from Jan Kara: "Three fixes for fsnotify / fanotify" * tag 'fsnotify_for_v7.1-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: fix inode reference leak in fsnotify_recalc_mask() fanotify: Fix spelling mistake "enforecement" -> "enforcement" fanotify: fix false positive on permission events
7 daysMerge tag 'for-7.1-rc1-tag' of ↵Linus Torvalds7-29/+104
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - space reservation fixes: - correctly undo 'may_use' accounting for remap tree - avoid double decrement of 'may_use' when submitting async io - actually enable the shutdown ioctl callback (not just the superblock ops) - raid stripe tree fixes when deleting extents - add missing error handling - fix various incorrect values set - fix transaction state when removing a directory, possibly leading to EIO during log replay - additional b-tree node key checks during metadata readahead - error handling and transaction abort updates * tag 'for-7.1-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix double-decrement of bytes_may_use in submit_one_async_extent() btrfs: check return value of btrfs_partially_delete_raid_extent() btrfs: handle -EAGAIN from btrfs_duplicate_item and refresh stale leaf pointer btrfs: replace ASSERT with proper error handling in stripe lookup fallback btrfs: fix wrong min_objectid in btrfs_previous_item() call btrfs: fix raid stripe search missing entries at leaf boundaries btrfs: copy devid in btrfs_partially_delete_raid_extent() btrfs: handle unexpected free-space-tree key types btrfs: fix missing last_unlink_trans update when removing a directory btrfs: don't clobber errors in add_remap_tree_entries() btrfs: enable shutdown ioctl for non-experimental builds btrfs: apply first key check for readahead when possible btrfs: abort transaction in do_remap_reloc_trans() on failure btrfs: fix bytes_may_use leak in do_remap_reloc_trans() btrfs: fix bytes_may_use leak in move_existing_remap()
7 daysMerge tag 'for-7.1/dm-fixes' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix metadata corruption in dm-thin * tag 'for-7.1/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-thin: fix metadata refcount underflow
7 daysMerge tag 'mailbox-v7.1' of ↵Linus Torvalds18-119/+124
git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox Pull mailbox updates from Jassi Brar: - core: fix NULL message handling and add API to query TX queue slots - test: resolve concurrency bugs, dangling IRQs, and memory leaks - dt-bindings: qcom: add Eliza IPCC - mtk: fix address calculation and pointer handling bugs - cix: resolve SCMI suspend timeouts - misc memory allocation optimizations and cleanups * tag 'mailbox-v7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jassibrar/mailbox: mailbox: mailbox-test: make data_ready a per-instance variable mailbox: mailbox-test: initialize struct earlier mailbox: mailbox-test: don't free the reused channel mailbox: mailbox-test: handle channel errors consistently mailbox: update kdoc for struct mbox_controller mailbox: add sanity check for channel array mailbox: mailbox-test: free channels on probe error mailbox: prefix new constants with MBOX_ dt-bindings: mailbox: qcom-ipcc: Document the Eliza Inter-Processor Communication Controller mailbox: cix: Add IRQF_NO_SUSPEND to mailbox interrupt mailbox: Fix NULL message support in mbox_send_message() mailbox: remove superfluous internal header mailbox: correct kdoc title for mbox_bind_client mailbox: test: really ignore optional memory resources mailbox: exynos: drop superfluous mbox setting per channel mailbox: mtk-cmdq: Fix CURR and END addr for task insert case mailbox: mtk-vcp-mailbox: Fix the return value in mtk_vcp_mbox_xlate() mailbox: hi6220: kzalloc + kcalloc to kzalloc mailbox: rockchip: kzalloc + kcalloc to kzalloc mailbox: add API to query available TX queue slots
7 daysdocs: cgroup: fix typo 'protetion' -> 'protection'Petr Vaněk1-1/+1
Fix a small typo in the description of the memory_hugetlb_accounting mount option. Signed-off-by: Petr Vaněk <arkamar@atlas.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
7 daysdocs: isofs: replace dead ECMA-119 FTP linkZiran Zhang1-1/+1
The original link is no longer valid. Replace it with the official PDF of the 2nd edition. The new link points to the exact 2nd edition that the existing comment in isofs.rst refers to. Signed-off-by: Ziran Zhang <zhangcoder@yeah.net> Link: https://patch.msgid.link/20260425142943.6809-1-zhangcoder@yeah.net Signed-off-by: Jan Kara <jack@suse.cz>
7 daysMerge tag 'kvm-x86-selftests_kernel_types-7.1' of ↵Paolo Bonzini185-2820/+2709
https://github.com/kvm-x86/linux into HEAD KVM selftests type renames for 7.1 Renames types across all KVM selftests to more closely align with types used in the kernel: vm_vaddr_t -> gva_t vm_paddr_t -> gpa_t uint64_t -> u64 uint32_t -> u32 uint16_t -> u16 uint8_t -> u8 int64_t -> s64 int32_t -> s32 int16_t -> s16 int8_t -> s8 Using the kernel's preferred types eliminates a source of friction for many contributors, as the majority of KVM selftests contributions come from kernel developers. The kernel names are also shorter, which allows for more concise code, and in any many cases eliminates newlines thanks to shorter types and parameter names. Rename variables and parameters as well as types, e.g. gpa instead of paddr, to again align with the kernel, and in a few cases to remove ambiguity, e.g. where paddr is used to refer to a _host_ physical address.
7 daysMerge tag 'kvmarm-fixes-7.1-1' of ↵Paolo Bonzini9-52/+86
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 7.1, take #1 - Allow tracing for non-pKVM, which was accidentally disabled when the series was merged - Rationalise the way the pKVM hypercall ranges are defined by using the same mechanism as already used for the vcpu_sysreg enum - Enforce that SMCCC function numbers relayed by the pKVM proxy are actually compliant with the specification - Fix a couple of feature to idreg mappings which resulted in the wrong sanitisation being applied - Fix the GICD_IIDR revision number field that could never been written correctly by userspace - Make kvm_vcpu_initialized() correctly use its parameter instead of relying on the surrounding context - Enforce correct ordering in __pkvm_init_vcpu(), plugging a potential pin leak at the same time - Move __pkvm_init_finalise() to a less dangerous spot, avoiding future problems - Restore functional userspace irqchip support after a four year breakage (last functional kernel was 5.18...). This is obviously ripe for garbage collection. - ... and the usual lot of spelling fixes
7 daysKVM: selftests: Add check_steal_time_uapi() implementation for LoongArchSean Christopherson1-0/+5
Define check_steal_time_uapi() for LoongArch so that the steal_time test builds. Note, while LoongArch's steal_time_init() has some funky asserts, none of the code is uniquely verifying KVM's uAPI. Cc: Jiakai Xu <xujiakai2025@iscas.ac.cn> Cc: Jiakai Xu <jiakaiPeanut@gmail.com> Cc: Andrew Jones <andrew.jones@oss.qualcomm.com> Cc: Anup Patel <anup@brainfault.org> Cc: Tianrui Zhao <zhaotianrui@loongson.cn> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Huacai Chen <chenhuacai@kernel.org> Fixes: 40351ed924dd ("KVM: selftests: Refactor UAPI tests into dedicated function") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20260420192644.3892050-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
8 daysLinux 7.1-rc1v7.1-rc1Linus Torvalds1-2/+2
8 daysMerge tag 'clk-for-linus' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux Pull clk fix from Stephen Boyd: "One more fix for the merge window to avoid a boot hang on Raspberry Pi 3B by marking the VEC clk critical so that it doesn't get turned off and hang the bus" * tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: clk: bcm: rpi: Mark VEC clock as CLK_IGNORE_UNUSED
8 daysMerge tag 'tsm-for-7.1' of ↵Linus Torvalds1-10/+9
git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm Pull PCIe TSP update from Dan Williams: "A small update for the TSM core. It is arguably a fix and coming in late as I have been offline the past few weeks: - Drop class_create() for the 'tsm' class" * tag 'tsm-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm: virt: coco: change tsm_class to a const struct
8 daysMerge tag 'kbuild-fixes-7.1-1' of ↵Linus Torvalds2-1/+9
git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux Pull Kbuild fixes from Nicolas Schier: - builddeb - avoid recompiles for non-cross-compiles Avoid triggering complete rebuilds for non-cross-compile Debian package builds by only triggering the rebuild of host tools for actual cross-compile builds - Never respect CONFIG_WERROR / W=e to fixdep Avoid spurious rebuilds of fixdep w/ and w/o -Werror during a single kbuild invocation by never respecting CONFIG_WERROR for fixdep * tag 'kbuild-fixes-7.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux: kbuild: Never respect CONFIG_WERROR / W=e to fixdep kbuild: builddeb - avoid recompiles for non-cross-compiles
8 daysMerge tag 'power-utilities-2026.04.25' of ↵Linus Torvalds3-195/+461
git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux Pull power utility updates from Len Brown: "x86_energy_perf_policy: - Initial SoC Slider support turbostat: - Display HT siblings in cpu# order - Add Module-ID column - Print Core-ID and APIC-ID in hex - Fix misc bugs" * tag 'power-utilities-2026.04.25' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: tools/power x86_energy_perf_policy: Version 2026.04.25 tools/power x86_energy_perf_policy.8: Document SoC Slider Options tools/power x86_energy_perf_policy: Enhances SoC Slider related checks tools/power turbostat: v2026.04.21 tools/power turbostat: Process HT siblings in CPU order tools/power turbostat: Show module_id column tools/power turbostat: Print core_id and apic_id in hex tools/power turbostat: Cleanup print helper functions tools/power turbostat: Fix --cpu-set 1 regression on HT systems tools/power turbostat: Fix --cpu-set 0 regression on HT systems tools/power turbostat: Fix unrecognized option '-P' tools/power turbostat: Fix AMD RAPL regression on big systems tools/power/x86: Add SOC slider and platform profile support
9 daysMerge tag 'rtc-7.1' of ↵Linus Torvalds20-90/+132
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "Subsystem: - add data_race() in rtc_dev_poll() Drivers: - remove i2c_match_id usage - abx80x: Disable alarm feature if no interrupt attached - ti-k3: support resuming from IO DDR low power mode" * tag 'rtc-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: rtc: abx80x: Disable alarm feature if no interrupt attached rtc: ntxec: fix OF node reference imbalance rtc: pic32: allow driver to be compiled with COMPILE_TEST rtc: ti-k3: Add support to resume from IO DDR low power mode rtc: cmos: Use platform_get_irq_optional() in cmos_platform_probe() dt-bindings: rtc: add olpc,xo1-rtc to trivial-rtc dt-bindings: rtc: sc2731: Add compatible for SC2730 rtc: add data_race() in rtc_dev_poll() rtc: armada38x: zalloc + calloc to single allocation dt-bindings: rtc: isl12026: convert to YAML schema dt-bindings: rtc: microcrystal,rv3028: Allow to specify vdd-supply rtc: max77686: convert to i2c_new_ancillary_device dt-bindings: rtc: mpfs-rtc: permit resets rtc: rx8025: Remove use of i2c_match_id() rtc: rv8803: Remove use of i2c_match_id() rtc: rs5c372: Remove use of i2c_match_id() rtc: pcf2127: Remove use of i2c_match_id() rtc: m41t80: Remove use of i2c_match_id() rtc: abx80x: Remove use of i2c_match_id()
9 daysMerge tag 'for-next-tpm-7.1-rc1' of ↵Linus Torvalds7-40/+64
git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd Pull tpm updates from Jarkko Sakkinen: "Here are the accumulated fixes for 7.1-rc1 and a single structural change worth mentioning separately: Rafael's commit converting tpm_crb from ACPI driver to a platform driver" * tag 'for-next-tpm-7.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd: tpm: tpm_tis: stop transmit if retries are exhausted tpm: tpm_tis: add error logging for data transfer tpm: avoid -Wunused-but-set-variable tpm: Use kfree_sensitive() to free auth session in tpm_dev_release() tpm2-sessions: Fix missing tpm_buf_destroy() in tpm2_read_public() tpm: Fix auth session leak in tpm2_get_random() error path tpm: i2c: atmel: fix block comment formatting tpm_crb: Convert ACPI driver to a platform one tpm: Make tcpci_pm_ops variable static const
9 daysMerge branches 'turbostat' and 'x86_energy_perf_policy' into power-utilitiesLen Brown2-117/+308
9 daystools/power x86_energy_perf_policy: Version 2026.04.25Len Brown1-95/+66
Since v2025.11.22: Initial SoC Slider support SoC Slider is an SoC-wide power/performance policy setting. On SoC Slider systems, EPP plays a diminished role. Whitespace cleanup via: indent -npro -kr -i8 -ts8 -sob -l160 -ss -ncs -cp1 No functional changes Signed-off-by: Len Brown <len.brown@intel.com>
9 daystools/power x86_energy_perf_policy.8: Document SoC Slider OptionsLen Brown1-0/+26
x86_energy_perf_policy accesses the SoC Slider via standard user/kernel APIs to the processor_thermal_soc_slider driver. Machines that support SoC Slider largely use it instead of EPP, which may continue to exist in a diminished role, or vanish entirely. Signed-off-by: Len Brown <len.brown@intel.com>
9 daystools/power x86_energy_perf_policy: Enhances SoC Slider related checksLen Brown1-38/+104
When processor_thermal_soc_slider is loaded, its slider and offset modparams are visible. Check that the driver actually registered the profile named "SoC Slider" before reading or writing these modparams. n.b. This utility allows writing the Slider and Offset modparams even if the driver policy is not "balanced". Currently the processor_thermal_soc_slider consults those modparams only in "balanced" mode. Signed-off-by: Len Brown <len.brown@intel.com>
9 daysclk: bcm: rpi: Mark VEC clock as CLK_IGNORE_UNUSEDMaíra Canal1-0/+7
On Raspberry Pi 3B, the VEC clock is used by the VideoCore firmware display driver, which remains active until the vc4 driver loads and sends NOTIFY_DISPLAY_DONE. If this clock is disabled during boot, a bus lockup happens and the firmware becomes unresponsive, causing a complete system lockup. Mark the VEC clock with CLK_IGNORE_UNUSED so it survives the unused clock disablement and remains available until the vc4 driver takes over display management. Fixes: 672299736af6 ("clk: bcm: rpi: Manage clock rate in prepare/unprepare callbacks") Reported-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/r/5f0bec08-f458-4fba-8bf3-06817a100c4c@sirena.org.uk Signed-off-by: Maíra Canal <mcanal@igalia.com> Link: https://patch.msgid.link/20260401111416.562279-2-mcanal@igalia.com Tested-by: Mark Brown <broonie@kernel.org> Signed-off-by: Mark Brown <broonie@kernel.org> Acked-by: Brian Masney <bmasney@redhat.com> # Active contributor to clk Reviewed-by: Stefan Wahren <wahrenst@gmx.net> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
9 daysMerge tag 'fbdev-for-7.1-rc1-2' of ↵Linus Torvalds6-19/+22
git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev Pull fbdev fixes from Helge Deller: - request memory region before use (cobalt_lcdfb, clps711x-fb, hgafb) - reference cleanups in failure path (offb, savage) - a spelling fix (atyfb) * tag 'fbdev-for-7.1-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev: fbdev: hgafb: Request memory region before ioremap fbdev: clps711x-fb: Request memory region for MMIO fbdev: cobalt_lcdfb: Request memory region fbdev: atyfb: Fix spelling mistake "enfore" -> "enforce" fbdev: savage: fix probe-path EDID cleanup leaks fbdev: offb: fix PCI device reference leak on probe failure
9 daysMerge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linuxLinus Torvalds6-92/+128
Pull ARM updates from Russell King: - fix a race condition handling PG_dcache_clean - further cleanups for the fault handling, allowing RT to be enabled - fixing nzones validation in adfs filesystem driver - fix for module unwinding * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux: ARM: 9463/1: Allow to enable RT ARM: 9472/1: fix race condition on PG_dcache_clean in __sync_icache_dcache() ARM: 9471/1: module: fix unwind section relocation out of range error fs/adfs: validate nzones in adfs_validate_bblk() ARM: provide individual is_translation_fault() and is_permission_fault() ARM: move FSR fault status definitions before fsr_fs() ARM: use BIT() and GENMASK() for fault status register fields ARM: move is_permission_fault() and is_translation_fault() to fault.h ARM: move vmalloc() lazy-page table population ARM: ensure interrupts are enabled in __do_user_fault()
9 dayssched_ext: Release cpus_read_lock on scx_link_sched() failure in root enableTejun Heo1-1/+3
scx_root_enable_workfn() takes cpus_read_lock() before scx_link_sched(sch), but the `if (ret) goto err_disable` on failure skips the matching cpus_read_unlock() - all other err_disable gotos along this path drop the lock first. scx_link_sched() only returns non-zero on the sub-sched path (parent != NULL), so the leak path is unreachable via the root caller today. Still, the unwind is out of line with the surrounding paths. Drop cpus_read_lock() before goto err_disable. v2: Correct Fixes: tag (Andrea Righi). Fixes: 25037af712eb ("sched_ext: Add rhashtable lookup for sub-schedulers") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org>
9 dayssched_ext: Reject NULL-sch callers in scx_bpf_task_set_slice/dsq_vtimeTejun Heo1-2/+2
scx_prog_sched(aux) returns NULL for TRACING / SYSCALL BPF progs that have no struct_ops association when the root scheduler has sub_attach set. scx_bpf_task_set_slice() and scx_bpf_task_set_dsq_vtime() pass that NULL into scx_task_on_sched(sch, p), which under CONFIG_EXT_SUB_SCHED is rcu_access_pointer(p->scx.sched) == sch. For any non-scx task p->scx.sched is NULL, so NULL == NULL returns true and the authority gate is bypassed - a privileged but non-struct_ops-associated prog can poke p->scx.slice / p->scx.dsq_vtime on arbitrary tasks. Reject !sch up front so the gate only admits callers with a resolved scheduler. Fixes: 245d09c594ea ("sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Refuse cross-task select_cpu_from_kfunc callsTejun Heo1-2/+17
select_cpu_from_kfunc() skipped pi_lock for @p when called from ops.select_cpu() or another rq-locked SCX op, assuming the held lock protects @p. scx_bpf_select_cpu_dfl() / __scx_bpf_select_cpu_and() accept an arbitrary KF_RCU task_struct, so a caller in e.g. ops.select_cpu(p1) or ops.enqueue(p1) can pass some other p2 - the held pi_lock / rq lock is p1's, not p2's - and reading p2->cpus_ptr / nr_cpus_allowed races with set_cpus_allowed_ptr() and migrate_disable_switch() on another CPU. Abort the scheduler on cross-task calls in both branches: for ops.select_cpu() use scx_kf_arg_task_ok() to verify @p is the wake-up task recorded in current->scx.kf_tasks[] by SCX_CALL_OP_TASK_RET(); for other rq-locked SCX ops compare task_rq(p) against scx_locked_rq(). v2: Switch the in_select_cpu cross-task check from direct_dispatch_task comparison to scx_kf_arg_task_ok(). The former spuriously rejects when ops.select_cpu() calls scx_bpf_dsq_insert() first, then calls scx_bpf_select_cpu_*() on the same task. (Andrea Righi) Fixes: 0022b328504d ("sched_ext: Decouple kfunc unlocked-context check from kf_mask") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Align cgroup #ifdef guards with SUB_SCHED vs GROUP_SCHEDTejun Heo1-19/+22
Two EXT_GROUP_SCHED/SUB_SCHED guards are misclassified: - scx_root_enable_workfn()'s cgroup_get(cgrp) and the err_put_cgrp unwind in scx_alloc_and_add_sched() are under `#if GROUP || SUB`, but the matching cgroup_put() in scx_sched_free_rcu_work() is inside `#ifdef SUB` only (via sch->cgrp, stored only under SUB). GROUP-only would leak a reference on every root-sched enable. - sch_cgroup() / set_cgroup_sched() live under `#if GROUP || SUB` but touch SUB-only fields (sch->cgrp, cgroup->scx_sched). GROUP-only wouldn't compile. GROUP needs CGROUP_SCHED; SUB needs only CGROUPS. CGROUPS=y/CGROUP_SCHED=n gives the reachable GROUP=n, SUB=y combination; GROUP=y, SUB=n isn't reachable today (SUB is def_bool y under CGROUPS). Neither miscategorization triggers a real bug in any reachable config, but keep the guards honest: - Narrow cgroup_get and err_put_cgrp to `#ifdef SUB` (matches the free-side put). - Move sch_cgroup() and set_cgroup_sched() to a separate `#ifdef SUB` block with no-op stubs for the !SUB case; keep root_cgroup() and scx_cgroup_{ lock,unlock}() under `#if GROUP || SUB` since those only need cgroup core. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Make bypass LB cpumasks per-schedulerTejun Heo2-14/+21
scx_bypass_lb_{donee,resched}_cpumask were file-scope statics shared by all scheduler instances. With CONFIG_EXT_SUB_SCHED, multiple sched instances each arm their own bypass_lb_timer; concurrent bypass_lb_node() calls RMW the global cpumasks with no lock, corrupting donee/resched decisions. Move the cpumasks into struct scx_sched, allocate them alongside the timer in scx_alloc_and_add_sched(), free them in scx_sched_free_rcu_work(). Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Pass held rq to SCX_CALL_OP() for core_sched_beforeTejun Heo1-1/+1
scx_prio_less() runs from core-sched's pick_next_task() path with rq locked but invokes ops.core_sched_before() with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass task_rq(a). Fixes: 7b0888b7cc19 ("sched_ext: Implement core-sched support") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Pass held rq to SCX_CALL_OP() for dump_cpu/dump_taskTejun Heo1-8/+6
scx_dump_state() walks CPUs with rq_lock_irqsave() held and invokes ops.dump_cpu / ops.dump_task with NULL locked_rq, leaving scx_locked_rq_state NULL. If the BPF callback calls a kfunc that re-acquires rq based on scx_locked_rq() - e.g. scx_bpf_cpuperf_set(cpu) - it re-acquires the already-held rq. Pass the held rq to SCX_CALL_OP(). Thread it into scx_dump_task() too. The pre-loop ops.dump call runs before rq_lock_irqsave() so keeps rq=NULL. Fixes: 07814a9439a3 ("sched_ext: Print debug dump after an error exit") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Save and restore scx_locked_rq across SCX_CALL_OPTejun Heo1-19/+30
SCX_CALL_OP{,_RET}() unconditionally clears scx_locked_rq_state to NULL on exit. Correct at the top level, but ops can recurse via scx_bpf_sub_dispatch(): a parent's ops.dispatch calls the helper, which invokes the child's ops.dispatch under another SCX_CALL_OP. When the inner call returns, the NULL clobbers the outer's state. The parent's BPF then calls kfuncs like scx_bpf_cpuperf_set() which read scx_locked_rq()==NULL and re-acquire the already-held rq. Snapshot scx_locked_rq_state on entry and restore on exit. Rename the rq parameter to locked_rq across all SCX_CALL_OP* macros so the snapshot local can be typed as 'struct rq *' without colliding with the parameter token in the expansion. SCX_CALL_OP_TASK{,_RET}() and SCX_CALL_OP_2TASKS_RET() funnel through the two base macros and inherit the fix. Fixes: 4f8b122848db ("sched_ext: Add basic building blocks for nested sub-scheduler dispatching") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Use dsq->first_task instead of list_empty() in dispatch_enqueue() ↵Tejun Heo1-4/+6
FIFO-tail dispatch_enqueue()'s FIFO-tail path used list_empty(&dsq->list) to decide whether to set dsq->first_task on enqueue. dsq->list can contain parked BPF iterator cursors (SCX_DSQ_LNODE_ITER_CURSOR), so list_empty() is not a reliable "no real task" check. If the last real task is unlinked while a cursor is parked, first_task becomes NULL; the next FIFO-tail enqueue then sees list_empty() == false and skips the first_task update, leaving scx_bpf_dsq_peek() returning NULL for a non-empty DSQ. Test dsq->first_task directly, which already tracks only real tasks and is maintained under dsq->lock. Fixes: 44f5c8ec5b9a ("sched_ext: Add lockless peek operation for DSQs") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com> Cc: Ryan Newton <newton@meta.com>
9 dayssched_ext: Resolve caller's scheduler in scx_bpf_destroy_dsq() / ↵Tejun Heo1-8/+9
scx_bpf_dsq_nr_queued() scx_bpf_create_dsq() resolves the calling scheduler via scx_prog_sched(aux) and inserts the new DSQ into that scheduler's dsq_hash. Its inverse scx_bpf_destroy_dsq() and the query helper scx_bpf_dsq_nr_queued() were hard-coded to rcu_dereference(scx_root), so a sub-scheduler could only destroy or query DSQs in the root scheduler's hash - never its own. If the root had a DSQ with the same id, the sub-sched silently destroyed it and the root aborted on the next dispatch ("invalid DSQ ID 0x0.."). Take a const struct bpf_prog_aux *aux via KF_IMPLICIT_ARGS and resolve the scheduler with scx_prog_sched(aux), matching scx_bpf_create_dsq(). Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Read scx_root under scx_cgroup_ops_rwsem in cgroup settersTejun Heo1-3/+6
scx_group_set_{weight,idle,bandwidth}() cache scx_root before acquiring scx_cgroup_ops_rwsem, so the pointer can be stale by the time the op runs. If the loaded scheduler is disabled and freed (via RCU work) and another is enabled between the naked load and the rwsem acquire, the reader sees scx_cgroup_enabled=true (the new scheduler's) but dereferences the freed one - UAF on SCX_HAS_OP(sch, ...) / SCX_CALL_OP(sch, ...). scx_cgroup_enabled is toggled only under scx_cgroup_ops_rwsem write (scx_cgroup_{init,exit}), so reading scx_root inside the rwsem read section correlates @sch with the enabled snapshot. Fixes: a5bd6ba30b33 ("sched_ext: Use cgroup_lock/unlock() to synchronize against cgroup operations") Cc: stable@vger.kernel.org # v6.18+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Don't disable tasks in scx_sub_enable_workfn() abort pathTejun Heo1-6/+30
scx_sub_enable_workfn()'s prep loop calls __scx_init_task(sch, p, false) without transitioning task state, then sets SCX_TASK_SUB_INIT. If prep fails partway, the abort path runs __scx_disable_and_exit_task(sch, p) on the marked tasks. Task state is still the parent's ENABLED, so that dispatches to the SCX_TASK_ENABLED arm and calls scx_disable_task(sch, p) - i.e. child->ops.disable() - for tasks on which child->ops.enable() never ran. A BPF sub-scheduler allocating per-task state in enable/freeing in disable would operate on uninitialized state. The dying-task branch in scx_disable_and_exit_task() has the same problem, and scx_enabling_sub_sched was cleared before the abort cleanup loop - a task exiting during cleanup tripped the WARN and skipped both ops.exit_task and the SCX_TASK_SUB_INIT clear, leaking per-task resources and leaving the task stuck. Introduce scx_sub_init_cancel_task() that calls ops.exit_task with cancelled=true - matching what the top-level init path does when init_task itself returns -errno. Use it in the abort loop and in the dying-task branch. scx_enabling_sub_sched now stays set until the abort loop finishes clearing SUB_INIT, so concurrent exits hitting the dying-task branch can still find @sch. That branch also clears SCX_TASK_SUB_INIT unconditionally when seen, leaving the task unmarked even if the WARN fires. Fixes: 337ec00b1d9c ("sched_ext: Implement cgroup sub-sched enabling and disabling") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Skip tasks with stale task_rq in bypass_lb_cpu()Tejun Heo1-0/+9
bypass_lb_cpu() transfers tasks between per-CPU bypass DSQs without migrating them - task_cpu() only updates when the donee later consumes the task via move_remote_task_to_local_dsq(). If the LB timer fires again before consumption and the new DSQ becomes a donor, @p is still on the previous CPU and task_rq(@p) != donor_rq. @p can't be moved without its own rq locked. Skip such tasks. Fixes: 95d1df610cdc ("sched_ext: Implement load balancer for bypass mode") Cc: stable@vger.kernel.org # v6.19+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Guard scx_dsq_move() against NULL kit->dsq after failed iter_newTejun Heo1-1/+11
bpf_iter_scx_dsq_new() clears kit->dsq on failure and bpf_iter_scx_dsq_{next,destroy}() guard against that. scx_dsq_move() doesn't - it dereferences kit->dsq immediately, so a BPF program that calls scx_bpf_dsq_move[_vtime]() after a failed iter_new oopses the kernel. Return false if kit->dsq is NULL. Fixes: 4c30f5ce4f7a ("sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()") Cc: stable@vger.kernel.org # v6.12+ Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Unregister sub_kset on scheduler disableTejun Heo1-0/+6
When ops.sub_attach is set, scx_alloc_and_add_sched() creates sub_kset as a child of &sch->kobj, which pins the parent with its own reference. The disable paths never call kset_unregister(), so the final kobject_put() in bpf_scx_unreg() leaves a stale reference and scx_kobj_release() never runs, leaking the whole struct scx_sched on every load/unload cycle. Unregister sub_kset in scx_root_disable() and scx_sub_disable() before kobject_del(&sch->kobj). Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
9 dayssched_ext: Defer scx_hardlockup() out of NMITejun Heo1-6/+27
scx_hardlockup() runs from NMI and eventually calls scx_claim_exit(), which takes scx_sched_lock. scx_sched_lock isn't NMI-safe and grabbing it from NMI context can lead to deadlocks. The hardlockup handler is best-effort recovery and the disable path it triggers runs off of irq_work anyway. Move the handle_lockup() call into an irq_work so it runs in IRQ context. Fixes: ebeca1f930ea ("sched_ext: Introduce cgroup sub-sched support") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>