summaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2026-01-21bpf: Verifier support for KF_IMPLICIT_ARGSIhor Solodrai1-0/+1
A kernel function bpf_foo marked with KF_IMPLICIT_ARGS flag is expected to have two associated types in BTF: * `bpf_foo` with a function prototype that omits implicit arguments * `bpf_foo_impl` with a function prototype that matches the kernel declaration of `bpf_foo`, but doesn't have a ksym associated with its name In order to support kfuncs with implicit arguments, the verifier has to know how to resolve a call of `bpf_foo` to the correct BTF function prototype and address. To implement this, in add_kfunc_call() kfunc flags are checked for KF_IMPLICIT_ARGS. For such kfuncs a BTF func prototype is adjusted to the one found for `bpf_foo_impl` (func_name + "_impl" suffix, by convention) function in BTF. This effectively changes the signature of the `bpf_foo` kfunc in the context of verification: from one without implicit args to the one with full argument list. The values of implicit arguments by design are provided by the verifier, and so they can only be of particular types. In this patch the only allowed implicit arg type is a pointer to struct bpf_prog_aux. In order for the verifier to correctly set an implicit bpf_prog_aux arg value at runtime, is_kfunc_arg_prog() is extended to check for the arg type. At a point when prog arg is determined in check_kfunc_args() the kfunc with implicit args already has a prototype with full argument list, so the existing value patch mechanism just works. If a new kfunc with KF_IMPLICIT_ARG is declared for an existing kfunc that uses a __prog argument (a legacy case), the prototype substitution works in exactly the same way, assuming the kfunc follows the _impl naming convention. The difference is only in how _impl prototype is added to the BTF, which is not the verifier's concern. See a subsequent resolve_btfids patch for details. __prog suffix is still supported at this point, but will be removed in a subsequent patch, after current users are moved to KF_IMPLICIT_ARGS. Introduction of KF_IMPLICIT_ARGS revealed an issue with zero-extension tracking, because an explicit rX = 0 in place of the verifier-supplied argument is now absent if the arg is implicit (the BPF prog doesn't pass a dummy NULL anymore). To mitigate this, reset the subreg_def of all caller saved registers in check_kfunc_call() [1]. [1] https://lore.kernel.org/bpf/b4a760ef828d40dac7ea6074d39452bb0dc82caa.camel@gmail.com/ Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-4-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-21bpf: Refactor btf_kfunc_id_set_containsIhor Solodrai1-2/+2
btf_kfunc_id_set_contains() is called by fetch_kfunc_meta() in the BPF verifier to get the kfunc flags stored in the .BTF_ids ELF section. If it returns NULL instead of a valid pointer, it's interpreted as an illegal kfunc usage failing the verification. There are two potential reasons for btf_kfunc_id_set_contains() to return NULL: 1. Provided kfunc BTF id is not present in relevant kfunc id sets. 2. The kfunc is not allowed, as determined by the program type specific filter [1]. The filter functions accept a pointer to `struct bpf_prog`, so they might implicitly depend on earlier stages of verification, when bpf_prog members are set. For example, bpf_qdisc_kfunc_filter() in linux/net/sched/bpf_qdisc.c inspects prog->aux->st_ops [2], which is initialized in: check_attach_btf_id() -> check_struct_ops_btf_id() So far this hasn't been an issue, because fetch_kfunc_meta() is the only caller of btf_kfunc_id_set_contains(). However in subsequent patches of this series it is necessary to inspect kfunc flags earlier in BPF verifier, in the add_kfunc_call(). To resolve this, refactor btf_kfunc_id_set_contains() into two interface functions: * btf_kfunc_flags() that simply returns pointer to kfunc_flags without applying the filters * btf_kfunc_is_allowed() that both checks for kfunc_flags existence (which is a requirement for a kfunc to be allowed) and applies the prog filters See [3] for the previous version of this patch. [1] https://lore.kernel.org/all/20230519225157.760788-7-aditi.ghag@isovalent.com/ [2] https://lore.kernel.org/all/20250409214606.2000194-4-ameryhung@gmail.com/ [3] https://lore.kernel.org/bpf/20251029190113.3323406-3-ihor.solodrai@linux.dev/ Reviewed-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev> Link: https://lore.kernel.org/r/20260120222638.3976562-2-ihor.solodrai@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-20bpf/verifier: Optimize ID mapping reset in states_equalQiliang Yuan1-0/+1
Currently, reset_idmap_scratch() performs a 4.7KB memset() in every states_equal() call. Optimize this by using a counter to track used ID mappings, replacing the O(N) memset() with an O(1) reset and bounding the search loop in check_ids(). Signed-off-by: Qiliang Yuan <realwujing@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/bpf/20260120023234.77673-1-realwujing@gmail.com
2026-01-16arm64/ftrace,bpf: Fix partial regs after bpf_prog_runJiri Olsa1-0/+25
Mahe reported issue with bpf_override_return helper not working when executed from kprobe.multi bpf program on arm. The problem is that on arm we use alternate storage for pt_regs object that is passed to bpf_prog_run and if any register is changed (which is the case of bpf_override_return) it's not propagated back to actual pt_regs object. Fixing this by introducing and calling ftrace_partial_regs_update function to propagate the values of changed registers (ip and stack). Reported-by: Mahe Tardy <mahe.tardy@gmail.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/bpf/20260112121157.854473-1-jolsa@kernel.org
2026-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc5Alexei Starovoitov37-68/+130
Cross-merge BPF and other fixes after downstream PR. No conflicts. Adjacent: Auto-merging MAINTAINERS Auto-merging Makefile Auto-merging kernel/bpf/verifier.c Auto-merging kernel/sched/ext.c Auto-merging mm/memcontrol.c Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-14btf: Optimize type lookup with binary searchDonglin Peng1-0/+1
Improve btf_find_by_name_kind() performance by adding binary search support for sorted types. Falls back to linear search for compatibility. Signed-off-by: Donglin Peng <pengdonglin@xiaomi.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20260109130003.3313716-7-dolinux.peng@gmail.com
2026-01-12Merge tag 'cgroup-for-6.19-rc5-fixes' of ↵Linus Torvalds1-11/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - Fix -Wflex-array-member-not-at-end warnings in cgroup_root * tag 'cgroup-for-6.19-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Eliminate cgrp_ancestor_storage in cgroup_root
2026-01-11treewide: Update email addressThomas Gleixner12-12/+12
In a vain attempt to consolidate the email zoo switch everything to the kernel.org account. Signed-off-by: Thomas Gleixner <tglx@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-01-09Merge tag 'vfs-6.19-rc5.fixes' of ↵Linus Torvalds2-2/+7
gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Remove incorrect __user annotation from struct xattr_args::value - Documentation fix: Add missing kernel-doc description for the @isnew parameter in ilookup5_nowait() to silence Sphinx warnings - Documentation fix: Fix kernel-doc comment for __start_dirop() - the function name in the comment was wrong and the @state parameter was undocumented - Replace dynamic folio_batch allocation with stack allocation in iomap_zero_range(). The dynamic allocation was problematic for ext4-on-iomap work (didn't handle allocation failure properly) and triggered lockdep complaints. Uses a flag instead to control batch usage - Re-add #ifdef guards around PIDFD_GET_<ns-type>_NAMESPACE ioctls. When a namespace type is disabled, ns->ops is NULL, causes crashes during inode eviction when closing the fd. The ifdefs were removed in a recent simplification but are still needed - Fixe a race where a folio could be unlocked before the trailing zeros (for EOF within the page) were written - Split out a dedicated lease_dispose_list() helper since lease code paths always know they're disposing of leases. Removes unnecessary runtime flag checks and prepares for upcoming lease_manager enhancements - Fix userland delegation requests succeeding despite conflicting opens. Previously, FL_LAYOUT and FL_DELEG leases bypassed conflict checks (a hack for nfsd). Adds new ->lm_open_conflict() lease_manager operation so userland delegations get proper conflict checking while nfsd can continue its own conflict handling - Fix LOOKUP_CACHED path lookups incorrectly falling through to the slow path. After legitimize_links() calls were conditionally elided, the routine would always fail with LOOKUP_CACHED regardless of whether there were any links. Now the flag is checked at the two callsites before calling legitimize_links() - Fix bug in media fd allocation in media_request_alloc() - Fix mismatched API calls in ecryptfs_mknod(): was calling end_removing() instead of end_creating() after ecryptfs_start_creating_dentry() - Fix dentry reference count leak in ecryptfs_mkdir(): a dget() of the lower parent dir was added but never dput()'d, causing BUG during lower filesystem unmount due to the still-in-use dentry * tag 'vfs-6.19-rc5.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: pidfs: protect PIDFD_GET_* ioctls() via ifdef ecryptfs: Release lower parent dentry after creating dir ecryptfs: Fix improper mknod pairing of start_creating()/end_removing() get rid of bogus __user in struct xattr_args::value VFS: fix __start_dirop() kernel-doc warnings fs: Describe @isnew parameter in ilookup5_nowait() fs: make sure to fail try_to_unlazy() and try_to_unlazy() for LOOKUP_CACHED netfs: Fix early read unlock of page with EOF in middle filelock: allow lease_managers to dictate what qualifies as a conflict filelock: add lease_dispose_list() helper iomap: replace folio_batch allocation with stack allocation media: mc: fix potential use-after-free in media_request_alloc()
2026-01-08Merge tag 'trace-v6.19-rc4' of ↵Linus Torvalds2-1/+10
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Remove useless assignment of soft_mode variable The function __ftrace_event_enable_disable() sets "soft_mode" in one of the branch paths but doesn't use it after that. Remove the setting of that variable. - Add a cond_resched() in ring_buffer_resize() The resize function that allocates all the pages for the ring buffer was causing a soft lockup on PREEMPT_NONE configs when allocating large buffers on machines with many CPUs. Hopefully this is the last cond_resched() needed to be added as PREEMPT_LAZY becomes the norm in the future. - Make ftrace_graph_ent depth field signed The "depth" field of struct ftrace_graph_ent was converted from "int" to "unsigned long" for alignment reasons to work with being embedded in other structures. The conversion from a signed to unsigned caused integrity checks to always pass as they were comparing "depth" to less than zero. Make the field signed long. - Add recursion protection to stack trace events A infinite recursion was triggered by a stack trace event calling RCU which internally called rcu_read_unlock_special(), which triggered an event that was also doing stacktraces which cause it to trigger the same RCU lock that called rcu_read_unlock_special() again. Update the trace_test_and_set_recursion() to add a set of context checks for events to use, and have the stack trace event use that for recursion protection. - Make the variable ftrace_dump_on_oops static The cleanup of sysctl that moved all the updates to the files that use them moved the reference of ftrace_dump_on_oops to where it is used. It is no longer used outside of the trace.c file. Make it static. * tag 'trace-v6.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: trace: ftrace_dump_on_oops[] is not exported, make it static tracing: Add recursion protection in kernel stack trace recording ftrace: Make ftrace_graph_ent depth field signed ring-buffer: Avoid softlockup in ring_buffer_resize() during memory free tracing: Drop unneeded assignment to soft_mode
2026-01-08Merge tag 'net-6.19-rc5' of ↵Linus Torvalds2-5/+6
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and wireless. Current release - fix to a fix: - net: do not write to msg_get_inq in callee - arp: do not assume dev_hard_header() does not change skb->head Current release - regressions: - wifi: mac80211: don't iterate not running interfaces - eth: mlx5: fix NULL pointer dereference in ioctl module EEPROM Current release - new code bugs: - eth: bnge: add AUXILIARY_BUS to Kconfig dependencies Previous releases - regressions: - eth: mlx5: dealloc forgotten PSP RX modify header Previous releases - always broken: - ping: fix ICMP out SNMP stats double-counting with ICMP sockets - bonding: preserve NETIF_F_ALL_FOR_ALL across TSO updates - bridge: fix C-VLAN preservation in 802.1ad vlan_tunnel egress - eth: bnxt: fix potential data corruption with HW GRO/LRO" * tag 'net-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (70 commits) arp: do not assume dev_hard_header() does not change skb->head net: enetc: fix build warning when PAGE_SIZE is greater than 128K atm: Fix dma_free_coherent() size tools: ynl: don't install tests net: do not write to msg_get_inq in callee bnxt_en: Fix NULL pointer crash in bnxt_ptp_enable during error cleanup net: usb: pegasus: fix memory leak in update_eth_regs_async() net: 3com: 3c59x: fix possible null dereference in vortex_probe1() net/sched: sch_qfq: Fix NULL deref when deactivating inactive aggregate in qfq_reset wifi: mac80211: collect station statistics earlier when disconnect wifi: mac80211: restore non-chanctx injection behaviour wifi: mac80211_hwsim: disable BHs for hwsim_radio_lock wifi: mac80211: don't iterate not running interfaces wifi: mac80211_hwsim: fix typo in frequency notification wifi: avoid kernel-infoleak from struct iw_point net: airoha: Fix schedule while atomic in airoha_ppe_deinit() selftests: netdevsim: add carrier state consistency test net: netdevsim: fix inconsistent carrier state after link/unlink selftests: drv-net: Bring back tool() to driver __init__s net/sched: act_api: avoid dereferencing ERR_PTR in tcf_idrinfo_destroy ...
2026-01-08cgroup: Eliminate cgrp_ancestor_storage in cgroup_rootMichal Koutný1-11/+14
The cgrp_ancestor_storage has two drawbacks: - it's not guaranteed that the member immediately follows struct cgrp in cgroup_root (root cgroup's ancestors[0] might thus point to a padding and not in cgrp_ancestor_storage proper), - this idiom raises warnings with -Wflex-array-member-not-at-end. Instead of relying on the auxiliary member in cgroup_root, define the 0-th level ancestor inside struct cgroup (needed for static allocation of cgrp_dfl_root), deeper cgroups would allocate flexible _low_ancestors[]. Unionized alias through ancestors[] will transparently join the two ranges. The above change would still leave the flexible array at the end of struct cgroup inside cgroup_root, so move cgrp also towards the end of cgroup_root to resolve the -Wflex-array-member-not-at-end. Link: https://lore.kernel.org/r/5fb74444-2fbb-476e-b1bf-3f3e279d0ced@embeddedor.com/ Reported-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Closes: https://lore.kernel.org/r/b3eb050d-9451-4b60-b06c-ace7dab57497@embeddedor.com/ Cc: David Laight <david.laight.linux@gmail.com> Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-01-07tracing: Add recursion protection in kernel stack trace recordingSteven Rostedt1-0/+9
A bug was reported about an infinite recursion caused by tracing the rcu events with the kernel stack trace trigger enabled. The stack trace code called back into RCU which then called the stack trace again. Expand the ftrace recursion protection to add a set of bits to protect events from recursion. Each bit represents the context that the event is in (normal, softirq, interrupt and NMI). Have the stack trace code use the interrupt context to protect against recursion. Note, the bug showed an issue in both the RCU code as well as the tracing stacktrace code. This only handles the tracing stack trace side of the bug. The RCU fix will be handled separately. Link: https://lore.kernel.org/all/20260102122807.7025fc87@gandalf.local.home/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Link: https://patch.msgid.link/20260105203141.515cd49f@gandalf.local.home Reported-by: Yao Kai <yaokai34@huawei.com> Tested-by: Yao Kai <yaokai34@huawei.com> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-07ftrace: Make ftrace_graph_ent depth field signedSteven Rostedt1-1/+1
The code has integrity checks to make sure that depth never goes below zero. But the depth field has recently been converted to unsigned long from "int" (for alignment reasons). As unsigned long can never be less than zero, the integrity checks no longer work. Convert depth to long from unsigned long to allow the integrity checks to work again. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: pengdonglin <pengdonglin@xiaomi.com> Link: https://patch.msgid.link/20260102143148.251c2e16@gandalf.local.home Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/all/aS6kGi0maWBl-MjZ@stanley.mountain/ Fixes: f83ac7544fbf7 ("function_graph: Enable funcgraph-args and funcgraph-retaddr to work simultaneously") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2026-01-07bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for ↵Leon Hwang2-2/+3
percpu_cgroup_storage maps Introduce BPF_F_ALL_CPUS flag support for percpu_cgroup_storage maps to allow updating values for all CPUs with a single value for update_elem API. Introduce BPF_F_CPU flag support for percpu_cgroup_storage maps to allow: * update value for specified CPU for update_elem API. * lookup value for specified CPU for lookup_elem API. The BPF_F_CPU flag is passed via map_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-6-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_hash and ↵Leon Hwang1-1/+3
lru_percpu_hash maps Introduce BPF_F_ALL_CPUS flag support for percpu_hash and lru_percpu_hash maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce BPF_F_CPU flag support for percpu_hash and lru_percpu_hash maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags along with embedded cpu info. * elem_flags along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-4-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07bpf: Add BPF_F_CPU and BPF_F_ALL_CPUS flags support for percpu_array mapsLeon Hwang1-2/+7
Introduce support for the BPF_F_ALL_CPUS flag in percpu_array maps to allow updating values for all CPUs with a single value for both update_elem and update_batch APIs. Introduce support for the BPF_F_CPU flag in percpu_array maps to allow: * update value for specified CPU for both update_elem and update_batch APIs. * lookup value for specified CPU for both lookup_elem and lookup_batch APIs. The BPF_F_CPU flag is passed via: * map_flags of lookup_elem and update_elem APIs along with embedded cpu info. * elem_flags of lookup_batch and update_batch APIs along with embedded cpu info. Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-3-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-07bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flagsLeon Hwang1-1/+22
Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags and check them for following APIs: * 'map_lookup_elem()' * 'map_update_elem()' * 'generic_map_lookup_batch()' * 'generic_map_update_batch()' And, get the correct value size for these APIs. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260107022022.12843-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06mm: drop mem_cgroup_usage() declaration from memcontrol.hRoman Gushchin1-1/+0
mem_cgroup_usage() is not used outside of memcg-v1 code, the declaration was added by a mistake. Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Link: https://lore.kernel.org/r/20260106042313.140256-1-roman.gushchin@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-06net: airoha: Fix npu rx DMA definitionsLorenzo Bianconi1-4/+4
Fix typos in npu rx DMA descriptor definitions. Fixes: b3ef7bdec66fb ("net: airoha: Add airoha_offload.h header") Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Link: https://patch.msgid.link/20260102-airoha-npu-dma-rx-def-fixes-v1-1-205fc6bf7d94@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-04netdev: preserve NETIF_F_ALL_FOR_ALL across TSO updatesDi Zhu1-1/+2
Directly increment the TSO features incurs a side effect: it will also directly clear the flags in NETIF_F_ALL_FOR_ALL on the master device, which can cause issues such as the inability to enable the nocache copy feature on the bonding driver. The fix is to include NETIF_F_ALL_FOR_ALL in the update mask, thereby preventing it from being cleared. Fixes: b0ce3508b25e ("bonding: allow TSO being set on bonding master") Signed-off-by: Di Zhu <zhud@hygon.cn> Link: https://patch.msgid.link/20251224012224.56185-1-zhud@hygon.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-04Merge tag 'core_urgent_for_v6.19_rc4' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core entry fix from Borislav Petkov: - Make sure clang inlines trivial local_irq_* helpers * tag 'core_urgent_for_v6.19_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry: Always inline local_irq_{enable,disable}_exit_to_user()
2026-01-03bpf: Replace __opt annotation with __nullable for kfuncsPuranjay Mohan1-1/+1
The __opt annotation was originally introduced specifically for buffer/size argument pairs in bpf_dynptr_slice() and bpf_dynptr_slice_rdwr(), allowing the buffer pointer to be NULL while still validating the size as a constant. The __nullable annotation serves the same purpose but is more general and is already used throughout the BPF subsystem for raw tracepoints, struct_ops, and other kfuncs. This patch unifies the two annotations by replacing __opt with __nullable. The key change is in the verifier's get_kfunc_ptr_arg_type() function, where mem/size pair detection is now performed before the nullable check. This ensures that buffer/size pairs are correctly classified as KF_ARG_PTR_TO_MEM_SIZE even when the buffer is nullable, while adding an !arg_mem_size condition to the nullable check prevents interference with mem/size pair handling. When processing KF_ARG_PTR_TO_MEM_SIZE arguments, the verifier now uses is_kfunc_arg_nullable() instead of the removed is_kfunc_arg_optional() to determine whether to skip size validation for NULL buffers. This is the first documentation added for the __nullable annotation, which has been in use since it was introduced but was previously undocumented. No functional changes to verifier behavior - nullable buffer/size pairs continue to work exactly as before. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102221513.1961781-1-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-03bpf: syscall: Introduce memcg enter/exit helpersPuranjay Mohan1-0/+15
Introduce bpf_map_memcg_enter() and bpf_map_memcg_exit() helpers to reduce code duplication in memcg context management. bpf_map_memcg_enter() gets the memcg from the map, sets it as active, and returns both the previous and the now active memcg. bpf_map_memcg_exit() restores the previous active memcg and releases the reference obtained during enter. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102200230.25168-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-02Merge tag 'io_uring-6.19-20260102' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux Pull io_uring fixes from Jens Axboe: - Removed dead argument length for io_uring_validate_mmap_request() - Use GFP_NOWAIT for overflow CQEs on legacy ring setups rather than GFP_ATOMIC, which makes it play nicer with memcg limits - Fix a potential circular locking issue with tctx node removal and exec based cancelations * tag 'io_uring-6.19-20260102' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux: io_uring/memmap: drop unused sz param in io_uring_validate_mmap_request() io_uring/tctx: add separate lock for list of tctx's in ctx io_uring: use GFP_NOWAIT for overflow CQEs on legacy rings
2026-01-02bpf: Remove redundant KF_TRUSTED_ARGS flag from all kfuncsPuranjay Mohan2-3/+2
Now that KF_TRUSTED_ARGS is the default for all kfuncs, remove the explicit KF_TRUSTED_ARGS flag from all kfunc definitions and remove the flag itself. Acked-by: Eduard Zingerman <eddyz87@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20260102180038.2708325-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-01io_uring/tctx: add separate lock for list of tctx's in ctxJens Axboe1-1/+7
ctx->tcxt_list holds the tasks using this ring, and it's currently protected by the normal ctx->uring_lock. However, this can cause a circular locking issue, as reported by syzbot, where cancelations off exec end up needing to remove an entry from this list: ====================================================== WARNING: possible circular locking dependency detected syzkaller #0 Tainted: G L ------------------------------------------------------ syz.0.9999/12287 is trying to acquire lock: ffff88805851c0a8 (&ctx->uring_lock){+.+.}-{4:4}, at: io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 but task is already holding lock: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&sig->cred_guard_mutex){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 proc_pid_attr_write+0x547/0x630 fs/proc/base.c:2837 vfs_write+0x27e/0xb30 fs/read_write.c:684 ksys_write+0x145/0x250 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 (sb_writers#3){.+.+}-{0:0}: percpu_down_read_internal include/linux/percpu-rwsem.h:53 [inline] percpu_down_read_freezable include/linux/percpu-rwsem.h:83 [inline] __sb_start_write include/linux/fs/super.h:19 [inline] sb_start_write+0x4d/0x1c0 include/linux/fs/super.h:125 mnt_want_write+0x41/0x90 fs/namespace.c:499 open_last_lookups fs/namei.c:4529 [inline] path_openat+0xadd/0x3dd0 fs/namei.c:4784 do_filp_open+0x1fa/0x410 fs/namei.c:4814 io_openat2+0x3e0/0x5c0 io_uring/openclose.c:143 __io_issue_sqe+0x181/0x4b0 io_uring/io_uring.c:1792 io_issue_sqe+0x165/0x1060 io_uring/io_uring.c:1815 io_queue_sqe io_uring/io_uring.c:2042 [inline] io_submit_sqe io_uring/io_uring.c:2320 [inline] io_submit_sqes+0xbf4/0x2140 io_uring/io_uring.c:2434 __do_sys_io_uring_enter io_uring/io_uring.c:3280 [inline] __se_sys_io_uring_enter+0x2e0/0x2b60 io_uring/io_uring.c:3219 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (&ctx->uring_lock){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: &ctx->uring_lock --> sb_writers#3 --> &sig->cred_guard_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&sig->cred_guard_mutex); lock(sb_writers#3); lock(&sig->cred_guard_mutex); lock(&ctx->uring_lock); *** DEADLOCK *** 1 lock held by syz.0.9999/12287: #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: prepare_bprm_creds fs/exec.c:1360 [inline] #0: ffff88802db5a2e0 (&sig->cred_guard_mutex){+.+.}-{4:4}, at: bprm_execve+0xb9/0x1400 fs/exec.c:1733 stack backtrace: CPU: 0 UID: 0 PID: 12287 Comm: syz.0.9999 Tainted: G L syzkaller #0 PREEMPT(full) Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_circular_bug+0x2e2/0x300 kernel/locking/lockdep.c:2043 check_noncircular+0x12e/0x150 kernel/locking/lockdep.c:2175 check_prev_add kernel/locking/lockdep.c:3165 [inline] check_prevs_add kernel/locking/lockdep.c:3284 [inline] validate_chain kernel/locking/lockdep.c:3908 [inline] __lock_acquire+0x15a6/0x2cf0 kernel/locking/lockdep.c:5237 lock_acquire+0x107/0x340 kernel/locking/lockdep.c:5868 __mutex_lock_common kernel/locking/mutex.c:614 [inline] __mutex_lock+0x187/0x1350 kernel/locking/mutex.c:776 io_uring_del_tctx_node+0xf0/0x2c0 io_uring/tctx.c:179 io_uring_clean_tctx+0xd4/0x1a0 io_uring/tctx.c:195 io_uring_cancel_generic+0x6ca/0x7d0 io_uring/cancel.c:646 io_uring_task_cancel include/linux/io_uring.h:24 [inline] begin_new_exec+0x10ed/0x2440 fs/exec.c:1131 load_elf_binary+0x9f8/0x2d70 fs/binfmt_elf.c:1010 search_binary_handler fs/exec.c:1669 [inline] exec_binprm fs/exec.c:1701 [inline] bprm_execve+0x92e/0x1400 fs/exec.c:1753 do_execveat_common+0x510/0x6a0 fs/exec.c:1859 do_execve fs/exec.c:1933 [inline] __do_sys_execve fs/exec.c:2009 [inline] __se_sys_execve fs/exec.c:2004 [inline] __x64_sys_execve+0x94/0xb0 fs/exec.c:2004 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff3a8b8f749 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ff3a9a97038 EFLAGS: 00000246 ORIG_RAX: 000000000000003b RAX: ffffffffffffffda RBX: 00007ff3a8de5fa0 RCX: 00007ff3a8b8f749 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000400 RBP: 00007ff3a8c13f91 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ff3a8de6038 R14: 00007ff3a8de5fa0 R15: 00007ff3a8f0fa28 </TASK> Add a separate lock just for the tctx_list, tctx_lock. This can nest under ->uring_lock, where necessary, and be used separately for list manipulation. For the cancelation off exec side, this removes the need to grab ->uring_lock, hence fixing the circular locking dependency. Reported-by: syzbot+b0e3b77ffaa8a4067ce5@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-12-31Merge tag 'platform-drivers-x86-v6.19-2' of ↵Linus Torvalds1-8/+9
git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform driver fixes from Ilpo Järvinen: - alienware-wmi-wmax: Area-51, x16, and 16X Aurora laptops support - asus-armoury: - Fix FA507R PPT data - Add TDP data for more laptop models - asus-nb-wmi: Asus Zenbook 14 display toggle key support - dell-lis3lv02d: Dell Latitude 5400 support - hp-bioscfg: Fix out-of-bounds array access in ACPI package parsing - ibm_rtl: Fix EBDA signature search pointer arithmetic - ideapad-laptop: Reassign KEY_CUT to KEY_SELECTIVE_SCREENSHOT - intel/pmt: - Fix kobject memory leak on init failure - Use valid pointers on error handling path - intel/vsec: Correct kernel doc comments - mellanox: mlxbf-pmc: Fix event names - msi-laptop: Add sysfs_remove_group() - samsumg-galaxybook: Do not cast pointer to a shorter type - think-lmi: WMI certificate thumbprint support for ThinkCenter - uniwill: Tuxedo Book BA15 Gen10 support * tag 'platform-drivers-x86-v6.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (22 commits) platform/x86: asus-armoury: add support for G835LW platform/x86: asus-armoury: fix ppt data for FA507R platform/x86/intel/pmt/discovery: use valid device pointer in dev_err_probe platform/x86: hp-bioscfg: Fix out-of-bounds array access in ACPI package parsing platform/x86: asus-armoury: add support for G615LR platform/x86: asus-armoury: add support for FA608UM platform/x86: asus-armoury: add support for GA403WR platform/x86: asus-armoury: add support for GU605CR platform/x86: ideapad-laptop: Reassign KEY_CUT to KEY_SELECTIVE_SCREENSHOT platform/x86: samsung-galaxybook: Fix problematic pointer cast platform/x86/intel/pmt: Fix kobject memory leak on init failure platform/x86/intel/vsec: correct kernel-doc comments platform/x86: ibm_rtl: fix EBDA signature search pointer arithmetic platform/x86: msi-laptop: add missing sysfs_remove_group() platform/x86: think-lmi: Add WMI certificate thumbprint support for ThinkCenter platform/x86: dell-lis3lv02d: Add Latitude 5400 platform/mellanox: mlxbf-pmc: Remove trailing whitespaces from event names platform/x86: asus-nb-wmi: Add keymap for display toggle platform/x86/uniwill: Add TUXEDO Book BA15 Gen10 platform/x86: alienware-wmi-wmax: Add support for Alienware 16X Aurora ...
2025-12-31Merge tag 'vfio-v6.19-rc4' of https://github.com/awilliam/linux-vfioLinus Torvalds1-1/+9
Pull VFIO fixes from Alex Williamson: - Restrict ROM access to dword to resolve a regression introduced with qword access seen on some Intel NICs. Update VGA region access to the same given lack of precedent for 64-bit users (Kevin Tian) - Fix missing .get_region_info_caps callback in the xe-vfio-pci variant driver due to integration through the DRM tree (Michal Wajdeczko) - Add aligned 64-bit access macros to tools/include/linux/types.h, allowing removal of uapi/linux/type.h includes from various vfio selftest, resolving redefinition warnings for integration with KVM selftests (David Matlack) - Fix error path memory leak in pds-vfio-pci variant driver (Zilin Guan) - Fix error path use-after-free in xe-vfio-pci variant driver (Alper Ak) * tag 'vfio-v6.19-rc4' of https://github.com/awilliam/linux-vfio: vfio/xe: Fix use-after-free in xe_vfio_pci_alloc_file() vfio/pds: Fix memory leak in pds_vfio_dirty_enable() vfio: selftests: Drop <uapi/linux/types.h> includes tools include: Add definitions for __aligned_{l,b}e64 vfio/xe: Add default handler for .get_region_info_caps vfio/pci: Disable qword access to the VGA region vfio/pci: Disable qword access to the PCI ROM bar
2025-12-29Merge tag 'mm-hotfixes-stable-2025-12-28-21-50' of ↵Linus Torvalds6-8/+27
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "27 hotfixes. 12 are cc:stable, 18 are MM. There's a patch series from Jiayuan Chen which fixes some issues with KASAN and vmalloc. Apart from that it's the usual shower of singletons - please see the respective changelogs for details" * tag 'mm-hotfixes-stable-2025-12-28-21-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (27 commits) mm/ksm: fix pte_unmap_unlock of wrong address in break_ksm_pmd_entry mm/page_owner: fix memory leak in page_owner_stack_fops->release() mm/memremap: fix spurious large folio warning for FS-DAX MAINTAINERS: notify the "Device Memory" community of memory hotplug changes sparse: update MAINTAINERS info mm/page_alloc: report 1 as zone_batchsize for !CONFIG_MMU mm: consider non-anon swap cache folios in folio_expected_ref_count() rust: maple_tree: rcu_read_lock() in destructor to silence lockdep mm: memcg: fix unit conversion for K() macro in OOM log mm: fixup pfnmap memory failure handling to use pgoff tools/mm/page_owner_sort: fix timestamp comparison for stable sorting selftests/mm: fix thread state check in uffd-unit-tests kernel/kexec: fix IMA when allocation happens in CMA area kernel/kexec: change the prototype of kimage_map_segment() MAINTAINERS: add ABI headers to KHO and LIVE UPDATE .mailmap: remove one of the entries for WangYuli mm/damon/vaddr: fix missing pte_unmap_unlock in damos_va_migrate_pmd_entry() MAINTAINERS: update one straggling entry for Bartosz Golaszewski mm/page_alloc: change all pageblocks migrate type on coalescing mm: leafops.h: correct kernel-doc function param. names ...
2025-12-27Merge tag 'driver-core-6.19-rc3' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core Pull driver core fixes from Danilo Krummrich: - Introduce DMA Rust helpers to avoid build errors when !CONFIG_HAS_DMA - Remove unnecessary (and hence incorrect) endian conversion in the Rust PCI driver sample code - Fix memory leak in the unwind path of debugfs_change_name() - Support non-const struct software_node pointers in SOFTWARE_NODE_REFERENCE(), after introducing _Generic() - Avoid NULL pointer dereference in the unwind path of simple_xattrs_free() * tag 'driver-core-6.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core: fs/kernfs: null-ptr deref in simple_xattrs_free() software node: Also support referencing non-constant software nodes debugfs: Fix memleak in debugfs_change_name(). samples: rust: fix endianness issue in rust_driver_pci rust: dma: add helpers for architectures without CONFIG_HAS_DMA
2025-12-24virtio_features: make it self-containedMichael S. Tsirkin1-0/+2
virtio_features.h uses WARN_ON_ONCE and memset so it must include linux/bug.h and linux/string.h Message-ID: <579986aa9b8d023844990d2a0e267382f8ad85d5.1764873799.git.mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-12-24virtio: make it self-containedMichael S. Tsirkin1-0/+2
virtio.h uses struct module, add a forward declaration to make the header self-contained. Message-ID: <9171b5cac60793eb59ab044c96ee038bf1363bee.1764873799.git.mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-12-24vfio/pci: Disable qword access to the PCI ROM barKevin Tian1-1/+9
Commit 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") enables qword access to the PCI bar resources. However certain devices (e.g. Intel X710) are observed with problem upon qword accesses to the rom bar, e.g. triggering PCI aer errors. This is triggered by Qemu which caches the rom content by simply does a pread() of the remaining size until it gets the full contents. The other bars would only perform operations at the same access width as their guest drivers. Instead of trying to identify all broken devices, universally disable qword access to the rom bar i.e. going back to the old way which worked reliably for years. Reported-by: Farrah Chen <farrah.chen@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220740 Fixes: 2b938e3db335 ("vfio/pci: Enable iowrite64 and ioread64 for vfio pci") Cc: stable@vger.kernel.org Signed-off-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Link: https://lore.kernel.org/r/20251218081650.555015-2-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex@shazbot.org>
2025-12-23bpf: arena: make arena kfuncs any context safePuranjay Mohan1-0/+16
Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251222195022.431211-4-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23mm: consider non-anon swap cache folios in folio_expected_ref_count()Bijan Tabatabai1-4/+4
Currently, folio_expected_ref_count() only adds references for the swap cache if the folio is anonymous. However, according to the comment above the definition of PG_swapcache in enum pageflags, shmem folios can also have PG_swapcache set. This patch makes sure references for the swap cache are added if folio_test_swapcache(folio) is true. This issue was found when trying to hot-unplug memory in a QEMU/KVM virtual machine. When initiating hot-unplug when most of the guest memory is allocated, hot-unplug hangs partway through removal due to migration failures. The following message would be printed several times, and would be printed again about every five seconds: [ 49.641309] migrating pfn b12f25 failed ret:7 [ 49.641310] page: refcount:2 mapcount:0 mapping:0000000033bd8fe2 index:0x7f404d925 pfn:0xb12f25 [ 49.641311] aops:swap_aops [ 49.641313] flags: 0x300000000030508(uptodate|active|owner_priv_1|reclaim|swapbacked|node=0|zone=3) [ 49.641314] raw: 0300000000030508 ffffed312c4bc908 ffffed312c4bc9c8 0000000000000000 [ 49.641315] raw: 00000007f404d925 00000000000c823b 00000002ffffffff 0000000000000000 [ 49.641315] page dumped because: migration failure When debugging this, I found that these migration failures were due to __migrate_folio() returning -EAGAIN for a small set of folios because the expected reference count it calculates via folio_expected_ref_count() is one less than the actual reference count of the folios. Furthermore, all of the affected folios were not anonymous, but had the PG_swapcache flag set, inspiring this patch. After applying this patch, the memory hot-unplug behaves as expected. I tested this on a machine running Ubuntu 24.04 with kernel version 6.8.0-90-generic and 64GB of memory. The guest VM is managed by libvirt and runs Ubuntu 24.04 with kernel version 6.18 (though the head of the mm-unstable branch as a Dec 16, 2025 was also tested and behaves the same) and 48GB of memory. The libvirt XML definition for the VM can be found at [1]. CONFIG_MHP_DEFAULT_ONLINE_TYPE_ONLINE_MOVABLE is set in the guest kernel so the hot-pluggable memory is automatically onlined. Below are the steps to reproduce this behavior: 1) Define and start and virtual machine host$ virsh -c qemu:///system define ./test_vm.xml # test_vm.xml from [1] host$ virsh -c qemu:///system start test_vm 2) Setup swap in the guest guest$ sudo fallocate -l 32G /swapfile guest$ sudo chmod 0600 /swapfile guest$ sudo mkswap /swapfile guest$ sudo swapon /swapfile 3) Use alloc_data [2] to allocate most of the remaining guest memory guest$ ./alloc_data 45 4) In a separate guest terminal, monitor the amount of used memory guest$ watch -n1 free -h 5) When alloc_data has finished allocating, initiate the memory hot-unplug using the provided xml file [3] host$ virsh -c qemu:///system detach-device test_vm ./remove.xml --live After initiating the memory hot-unplug, you should see the amount of available memory in the guest decrease, and the amount of used swap data increase. If everything works as expected, when all of the memory is unplugged, there should be around 8.5-9GB of data in swap. If the unplugging is unsuccessful, the amount of used swap data will settle below that. If that happens, you should be able to see log messages in dmesg similar to the one posted above. Link: https://lkml.kernel.org/r/20251216200727.2360228-1-bijan311@gmail.com Link: https://github.com/BijanT/linux_patch_files/blob/main/test_vm.xml [1] Link: https://github.com/BijanT/linux_patch_files/blob/main/alloc_data.c [2] Link: https://github.com/BijanT/linux_patch_files/blob/main/remove.xml [3] Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation") Signed-off-by: Bijan Tabatabai <bijan311@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shivank Garg <shivankg@amd.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kairui Song <ryncsn@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: fixup pfnmap memory failure handling to use pgoffAnkit Agrawal1-0/+2
The memory failure handling implementation for the PFNMAP memory with no struct pages is faulty. The VA of the mapping is determined based on the the PFN. It should instead be based on the file mapping offset. At the occurrence of poison, the memory_failure_pfn is triggered on the poisoned PFN. Introduce a callback function that allows mm to translate the PFN to the corresponding file page offset. The kernel module using the registration API must implement the callback function and provide the translation. The translated value is then used to determine the VA information and sending the SIGBUS to the usermode process mapped to the poisoned PFN. The callback is also useful for the driver to be notified of the poisoned PFN, which may then track it. Link: https://lkml.kernel.org/r/20251211070603.338701-2-ankita@nvidia.com Fixes: 2ec41967189c ("mm: handle poisoning of pfn without struct pages") Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Kevin Tian <kevin.tian@intel.com> Cc: Matthew R. Ochs <mochs@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Neo Jia <cjia@nvidia.com> Cc: Vikram Sethi <vsethi@nvidia.com> Cc: Yishai Hadas <yishaih@nvidia.com> Cc: Zhi Wang <zhiw@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kernel/kexec: change the prototype of kimage_map_segment()Pingfan Liu1-2/+2
The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Link: https://lkml.kernel.org/r/20251216014852.8737-1-piliu@redhat.com Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu <piliu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Mimi Zohar <zohar@linux.ibm.com> Cc: Roberto Sassu <roberto.sassu@huawei.com> Cc: Alexander Graf <graf@amazon.com> Cc: Steven Chen <chenste@linux.microsoft.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: leafops.h: correct kernel-doc function param. namesRandy Dunlap1-2/+2
Modify the kernel-doc function parameter names to prevent kernel-doc warnings: Warning: include/linux/leafops.h:135 function parameter 'entry' not described in 'leafent_type' Warning: include/linux/leafops.h:540 function parameter 'pte' not described in 'pte_is_uffd_marker' Link: https://lkml.kernel.org/r/20251214201517.2187051-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23kasan: refactor pcpu kasan vmalloc unpoisonMaciej Wieczor-Retman1-0/+15
A KASAN tag mismatch, possibly causing a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. It was reported on arm64 and reproduced on x86. It can be explained in the following points: 1. There can be more than one virtual memory chunk. 2. Chunk's base address has a tag. 3. The base address points at the first chunk and thus inherits the tag of the first chunk. 4. The subsequent chunks will be accessed with the tag from the first chunk. 5. Thus, the subsequent chunks need to have their tag set to match that of the first chunk. Refactor code by reusing __kasan_unpoison_vmalloc in a new helper in preparation for the actual fix. Link: https://lkml.kernel.org/r/eb61d93b907e262eefcaa130261a08bcb6c5ce51.1764874575.git.m.wieczorretman@pm.me Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS") Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm/kasan: fix incorrect unpoisoning in vrealloc for KASANJiayuan Chen1-0/+1
Patch series "kasan: vmalloc: Fixes for the percpu allocator and vrealloc", v3. Patches fix two issues related to KASAN and vmalloc. The first one, a KASAN tag mismatch, possibly resulting in a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. Initially it was only noticed on x86 [1] but later a similar issue was also reported on arm64 [2]. Specifically the problem is related to how vm_structs interact with pcpu_chunks - both when they are allocated, assigned and when pcpu_chunk addresses are derived. When vm_structs are allocated they are unpoisoned, each with a different random tag, if vmalloc support is enabled along the KASAN mode. Later when first pcpu chunk is allocated it gets its 'base_addr' field set to the first allocated vm_struct. With that it inherits that vm_struct's tag. When pcpu_chunk addresses are later derived (by pcpu_chunk_addr(), for example in pcpu_alloc_noprof()) the base_addr field is used and offsets are added to it. If the initial conditions are satisfied then some of the offsets will point into memory allocated with a different vm_struct. So while the lower bits will get accurately derived the tag bits in the top of the pointer won't match the shadow memory contents. The solution (proposed at v2 of the x86 KASAN series [3]) is to unpoison the vm_structs with the same tag when allocating them for the per cpu allocator (in pcpu_get_vm_areas()). The second one reported by syzkaller [4] is related to vrealloc and happens because of random tag generation when unpoisoning memory without allocating new pages. This breaks shadow memory tracking and needs to reuse the existing tag instead of generating a new one. At the same time an inconsistency in used flags is corrected. This patch (of 3): Syzkaller reported a memory out-of-bounds bug [4]. This patch fixes two issues: 1. In vrealloc the KASAN_VMALLOC_VM_ALLOC flag is missing when unpoisoning the extended region. This flag is required to correctly associate the allocation with KASAN's vmalloc tracking. Note: In contrast, vzalloc (via __vmalloc_node_range_noprof) explicitly sets KASAN_VMALLOC_VM_ALLOC and calls kasan_unpoison_vmalloc() with it. vrealloc must behave consistently -- especially when reusing existing vmalloc regions -- to ensure KASAN can track allocations correctly. 2. When vrealloc reuses an existing vmalloc region (without allocating new pages) KASAN generates a new tag, which breaks tag-based memory access tracking. Introduce KASAN_VMALLOC_KEEP_TAG, a new KASAN flag that allows reusing the tag already attached to the pointer, ensuring consistent tag behavior during reallocation. Pass KASAN_VMALLOC_KEEP_TAG and KASAN_VMALLOC_VM_ALLOC to the kasan_unpoison_vmalloc inside vrealloc_node_align_noprof(). Link: https://lkml.kernel.org/r/cover.1765978969.git.m.wieczorretman@pm.me Link: https://lkml.kernel.org/r/38dece0a4074c43e48150d1e242f8242c73bf1a5.1764874575.git.m.wieczorretman@pm.me Link: https://lore.kernel.org/all/e7e04692866d02e6d3b32bb43b998e5d17092ba4.1738686764.git.maciej.wieczor-retman@intel.com/ [1] Link: https://lore.kernel.org/all/aMUrW1Znp1GEj7St@MiWiFi-R3L-srv/ [2] Link: https://lore.kernel.org/all/CAPAsAGxDRv_uFeMYu9TwhBVWHCCtkSxoWY4xmFB_vowMbi8raw@mail.gmail.com/ [3] Link: https://syzkaller.appspot.com/bug?extid=997752115a851cb0cf36 [4] Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Co-developed-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reported-by: syzbot+997752115a851cb0cf36@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68e243a2.050a0220.1696c6.007d.GAE@google.com/T/ Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23genalloc.h: fix htmldocs warningAndrew Morton1-0/+1
WARNING: include/linux/genalloc.h:52 function parameter 'start_addr' not described in 'genpool_algo_t' Fixes: 52fbf1134d47 ("lib/genalloc.c: fix allocation of aligned buffer from non-aligned chunk") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lkml.kernel.org/r/20251127130624.563597e3@canb.auug.org.au Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Alexey Skidanov <alexey.skidanov@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-12-23mm: introduce BPF kfuncs to access memcg statistics and eventsRoman Gushchin1-0/+14
Introduce BPF kfuncs to conveniently access memcg data: - bpf_mem_cgroup_vm_events(), - bpf_mem_cgroup_memory_events(), - bpf_mem_cgroup_usage(), - bpf_mem_cgroup_page_state(), - bpf_mem_cgroup_flush_stats(). These functions are useful for implementing BPF OOM policies, but also can be used to accelerate access to the memcg data. Reading it through cgroupfs is much more expensive, roughly 5x, mostly because of the need to convert the data into the text and back. JP Kobryn: An experiment was setup to compare the performance of a program that uses the traditional method of reading memory.stat vs a program using the new kfuncs. The control program opens up the root memory.stat file and for 1M iterations reads, converts the string values to numeric data, then seeks back to the beginning. The experimental program sets up the requisite libbpf objects and for 1M iterations invokes a bpf program which uses the kfuncs to fetch all available stats for node_stat_item, memcg_stat_item, and vm_event_item types. The results showed a significant perf benefit on the experimental side, outperforming the control side by a margin of 93%. In kernel mode, elapsed time was reduced by 80%, while in user mode, over 99% of time was saved. control: elapsed time real 0m38.318s user 0m25.131s sys 0m13.070s experiment: elapsed time real 0m2.789s user 0m0.187s sys 0m2.512s control: perf data 33.43% a.out libc.so.6 [.] __vfscanf_internal 6.88% a.out [kernel.kallsyms] [k] vsnprintf 6.33% a.out libc.so.6 [.] _IO_fgets 5.51% a.out [kernel.kallsyms] [k] format_decode 4.31% a.out libc.so.6 [.] __GI_____strtoull_l_internal 3.78% a.out [kernel.kallsyms] [k] string 3.53% a.out [kernel.kallsyms] [k] number 2.71% a.out libc.so.6 [.] _IO_sputbackc 2.41% a.out [kernel.kallsyms] [k] strlen 1.98% a.out a.out [.] main 1.70% a.out libc.so.6 [.] _IO_getline_info 1.51% a.out libc.so.6 [.] __isoc99_sscanf 1.47% a.out [kernel.kallsyms] [k] memory_stat_format 1.47% a.out [kernel.kallsyms] [k] memcpy_orig 1.41% a.out [kernel.kallsyms] [k] seq_buf_printf experiment: perf data 10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query 6.90% memcgstat [kernel.kallsyms] [k] memcg_page_state_output 3.55% memcgstat [kernel.kallsyms] [k] _raw_spin_lock 3.12% memcgstat [kernel.kallsyms] [k] memcg_events 2.87% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook 2.73% memcgstat [kernel.kallsyms] [k] kmem_cache_free 2.70% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack 2.25% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook 2.06% memcgstat [kernel.kallsyms] [k] get_page_from_freelist Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Co-developed-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Link: https://lore.kernel.org/r/20251223044156.208250-5-roman.gushchin@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-23mm: declare memcg_page_state_output() in memcontrol.hRoman Gushchin1-0/+6
To use memcg_page_state_output() in bpf_memcontrol.c move the declaration from v1-specific memcontrol-v1.h to memcontrol.h. Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Link: https://lore.kernel.org/r/20251223044156.208250-2-roman.gushchin@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-22platform/x86/intel/vsec: correct kernel-doc commentsRandy Dunlap1-8/+9
Fix kernel-doc warnings in intel_vsec.h to eliminate all kernel-doc warnings: Warning: include/linux/intel_vsec.h:92 struct member 'read_telem' not described in 'pmt_callbacks' Warning: include/linux/intel_vsec.h:146 expecting prototype for struct intel_sec_device. Prototype was for struct intel_vsec_device instead Warning: include/linux/intel_vsec.h:146 struct member 'priv_data_size' not described in 'intel_vsec_device' In struct pmt_callbacks, correct the kernel-doc for @read_telem. kernel-doc doesn't support documenting callback function parameters, so drop the '@' signs on those and use "* *" to make them somewhat readable in the produced documentation output. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20251216063801.2896495-1-rdunlap@infradead.org Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-12-22software node: Also support referencing non-constant software nodesSakari Ailus1-0/+1
Fwnode references are be implemented differently if referenced node is a software node. _Generic() is used to differentiate between the two cases but only const software nodes were present in the selection. Also add non-const software nodes. Reported-by: Kenneth Crudup <kenny@panix.com> Closes: https://lore.kernel.org/all/af773b82-bef2-4209-baaf-526d4661b7fc@panix.com/ Fixes: d7cdbbc93c56 ("software node: allow referencing firmware nodes") Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Tested-By: Kenneth R. Crudup <kenny@panix.com> Tested-by: Mehdi Djait <mehdi.djait@linux.intel.com> # Dell XPS 9315 Reviewed-by: Mehdi Djait <mehdi.djait@linux.intel.com> Link: https://patch.msgid.link/20251219083638.2454138-1-sakari.ailus@linux.intel.com Signed-off-by: Danilo Krummrich <dakr@kernel.org>
2025-12-22Merge tag 'x86-urgent-2025-12-21' of ↵Linus Torvalds1-1/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Fix FPU core dumps on certain CPU models - Fix htmldocs build warning - Export TLB tracing event name via header - Remove unused constant from <linux/mm_types.h> - Fix comments - Fix whitespace noise in documentation - Fix variadic structure's definition to un-confuse UBSAN - Fix posted MSI interrupts irq_retrigger() bug - Fix asm build failure with older GCC builds * tag 'x86-urgent-2025-12-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bug: Fix old GCC compile fails x86/msi: Make irq_retrigger() functional for posted MSI x86/platform/uv: Fix UBSAN array-index-out-of-bounds mm: Remove tlb_flush_reason::NR_TLB_FLUSH_REASONS from <linux/mm_types.h> x86/mm/tlb/trace: Export the TLB_REMOTE_WRONG_CPU enum in <trace/events/tlb.h> x86/sgx: Remove unmatched quote in __sgx_encl_extend function comment x86/boot/Documentation: Fix whitespace noise in boot.rst x86/fpu: Fix FPU state core dump truncation on CPUs with no extended xfeatures x86/boot/Documentation: Fix htmldocs build warning due to malformed table in boot.rst
2025-12-21bpf: arm64: Optimize recursion detection by not using atomicsPuranjay Mohan1-3/+29
BPF programs detect recursion using a per-CPU 'active' flag in struct bpf_prog. The trampoline currently sets/clears this flag with atomic operations. On some arm64 platforms (e.g., Neoverse V2 with LSE), per-CPU atomic operations are relatively slow. Unlike x86_64 - where per-CPU updates can avoid cross-core atomicity, arm64 LSE atomics are always atomic across all cores, which is unnecessary overhead for strictly per-CPU state. This patch removes atomics from the recursion detection path on arm64 by changing 'active' to a per-CPU array of four u8 counters, one per context: {NMI, hard-irq, soft-irq, normal}. The running context uses a non-atomic increment/decrement on its element. After increment, recursion is detected by reading the array as a u32 and verifying that only the expected element changed; any change in another element indicates inter-context recursion, and a value > 1 in the same element indicates same-context recursion. For example, starting from {0,0,0,0}, a normal-context trigger changes the array to {0,0,0,1}. If an NMI arrives on the same CPU and triggers the program, the array becomes {1,0,0,1}. When the NMI context checks the u32 against the expected mask for normal (0x00000001), it observes 0x01000001 and correctly reports recursion. Same-context recursion is detected analogously. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21bpf: move recursion detection logic to helpersPuranjay Mohan1-0/+10
BPF programs detect recursion by doing atomic inc/dec on a per-cpu active counter from the trampoline. Create two helpers for operations on this active counter, this makes it easy to changes the recursion detection logic in future. This commit makes no functional changes. Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Link: https://lore.kernel.org/r/20251219184422.2899902-2-puranjay@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-12-21clang: work around asm output constraint problemsEric Dumazet2-1/+3
Work around clang problems with "=rm" asm constraint. clang seems to always chose the memory output, while it is almost always the worst choice. Add ASM_OUTPUT_RM so that we can replace "=rm" constraint where it matters for clang, while not penalizing gcc. Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>