summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2026-02-27Merge tag 'slab-for-7.0-rc1' of ↵Linus Torvalds2-18/+37
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - Fix for spurious page allocation warnings on sheaf refill (Harry Yoo) - Fix for CONFIG_MEM_ALLOC_PROFILING_DEBUG warnings (Suren Baghdasaryan) - Fix for kernel-doc warning on ksize() (Sanjay Chitroda) - Fix to avoid setting slab->stride later than on slab allocation. Doesn't yet fix the reports from powerpc; debugging is making progress (Harry Yoo) * tag 'slab-for-7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: initialize slab->stride early to avoid memory ordering issues mm/slub: drop duplicate kernel-doc for ksize() mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available
2026-02-27mm/slab: initialize slab->stride early to avoid memory ordering issuesHarry Yoo1-2/+3
When alloc_slab_obj_exts() is called later (instead of during slab allocation and initialization), slab->stride and slab->obj_exts are updated after the slab is already accessible by multiple CPUs. The current implementation does not enforce memory ordering between slab->stride and slab->obj_exts. For correctness, slab->stride must be visible before slab->obj_exts. Otherwise, concurrent readers may observe slab->obj_exts as non-zero while stride is still stale. With stale slab->stride, slab_obj_ext() could return the wrong obj_ext. This could cause two problems: - obj_cgroup_put() is called on the wrong objcg, leading to a use-after-free due to incorrect reference counting [1] by decrementing the reference count more than it was incremented. - refill_obj_stock() is called on the wrong objcg, leading to a page_counter overflow [2] by uncharging more memory than charged. Fix this by unconditionally initializing slab->stride in alloc_slab_obj_exts_early(), before the need_slab_obj_exts() check. In the case of SLAB_OBJ_EXT_IN_OBJ, it is overridden in the function. This ensures updates to slab->stride become visible before the slab can be accessed by other CPUs via the per-node partial slab list (protected by spinlock with acquire/release semantics). Thanks to Shakeel Butt for pointing out this issue [3]. [vbabka@kernel.org: the bug reports [1] and [2] are not yet fully fixed, with investigation ongoing, but it is nevertheless a step in the right direction to only set stride once after allocating the slab and not change it later ] Fixes: 7a8e71bc619d ("mm/slab: use stride to access slabobj_ext") Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Link: https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com [1] Link: https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com [2] Link: https://lore.kernel.org/linux-mm/aZu9G9mVIVzSm6Ft@hyeyoo [3] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-02-26mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXTSuren Baghdasaryan2-12/+25
alloc_empty_sheaf() allocates sheaves from SLAB_KMALLOC caches using __GFP_NO_OBJ_EXT to avoid recursion, however it does not mark their allocation tags empty before freeing, which results in a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is set. Fix this by marking allocation tags for such sheaves as empty. The problem was technically introduced in commit 4c0a17e28340 but only becomes possible to hit with commit 913ffd3a1bf5. Fixes: 4c0a17e28340 ("slab: prevent recursive kmalloc() in alloc_empty_sheaf()") Fixes: 913ffd3a1bf5 ("slab: handle kmalloc sheaves bootstrap") Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/all/20260223155128.3849-1-00107082@163.com/ Analyzed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: David Wang <00107082@163.com> Link: https://patch.msgid.link/20260225163407.2218712-1-surenb@google.com Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-02-26mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is availableHarry Yoo1-4/+9
When refill_sheaf() is called, failing to refill the sheaf doesn't necessarily mean the allocation will fail because a fallback path might be available and serve the allocation request. Suppress spurious warnings by passing __GFP_NOWARN along with __GFP_NOMEMALLOC whenever a fallback path is available. When the caller is alloc_full_sheaf() or __pcs_replace_empty_main(), the kernel always falls back to the slowpath (__slab_alloc_node()). For __prefill_sheaf_pfmemalloc(), the fallback path is available only when gfp_pfmemalloc_allowed() returns true. Reported-and-tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Closes: https://lore.kernel.org/linux-mm/aZt2-oS9lkmwT7Ch@debian.local Fixes: 1ce20c28eafd ("slab: handle pfmemalloc slabs properly with sheaves") Link: https://lore.kernel.org/linux-mm/aZwSreGj9-HHdD-j@hyeyoo Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260223133322.16705-1-harry.yoo@oracle.com Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
2026-02-24mm: fix NULL NODE_DATA dereference for memoryless nodes on bootMing Lei1-1/+5
Commit d49004c5f0c1 ("arch, mm: consolidate initialization of nodes, zones and memory map") moved free_area_init() from setup_arch() to mm_core_init_early(), which runs after setup_arch() returns. This changed the ordering relative to init_cpu_to_node() on x86. Before the commit, free_area_init() ran during paging_init() (called from setup_arch()) *before* init_cpu_to_node(). After the commit, it runs *after* init_cpu_to_node(). On machines with memoryless NUMA nodes (e.g., node 0 has CPUs but no memory), this causes a NULL pointer dereference: 1. numa_register_nodes() skips memoryless nodes: no alloc_node_data() and no node_set_online() for them. 2. init_cpu_to_node() sets memoryless nodes online (they have CPUs) but does not allocate NODE_DATA. 3. free_area_init() checks "if (!node_online(nid))" to decide whether to call alloc_offline_node_data(). Since the memoryless node is now online, the allocation is skipped, leaving NODE_DATA(nid) == NULL. 4. The immediate "pgdat = NODE_DATA(nid)" dereferences NULL. The crash happens before console_init(), so no output is visible without earlyprintk. With earlyprintk enabled, the following panic is observed: BUG: unable to handle page fault for address: 000000000002a1e0 Oops: Oops: 0000 [#1] SMP NOPTI RIP: 0010:free_area_init_node+0x3a/0x540 Call Trace: <TASK> free_area_init+0x331/0x4e0 start_kernel+0x69/0x4a0 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0x125/0x130 common_startup_64+0x13e/0x148 </TASK> Kernel panic - not syncing: Attempted to kill the idle task! Fix this by checking "if (!NODE_DATA(nid))" instead of "if (!node_online(nid))". This directly tests whether the per-node data structure needs to be allocated, regardless of the node's online status. This change is also safe for non-x86 architectures as they all allocate NODE_DATA for every node including memoryless ones, so the check simply evaluates to false with no change in behavior. Link: https://lkml.kernel.org/r/20260222115702.3659-1-ming.lei@redhat.com Fixes: d49004c5f0c1 ("arch, mm: consolidate initialization of nodes, zones and memory map") Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24mm/kfence: fix KASAN hardware tag faults during late enablementAlexander Potapenko2-7/+10
When KASAN hardware tags are enabled, re-enabling KFENCE late (via /sys/module/kfence/parameters/sample_interval) causes KASAN faults. This happens because the KFENCE pool and metadata are allocated via the page allocator, which tags the memory, while KFENCE continues to access it using untagged pointers during initialization. Use __GFP_SKIP_KASAN for late KFENCE pool and metadata allocations to ensure the memory remains untagged, consistent with early allocations from memblock. To support this, add __GFP_SKIP_KASAN to the allowlist in __alloc_contig_verify_gfp_mask(). Link: https://lkml.kernel.org/r/20260220144940.2779209-1-glider@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Ernesto Martinez Garcia <ernesto.martinezgarcia@tugraz.at> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24mm/damon/core: disallow non-power of two min_region_szSeongJae Park1-0/+3
DAMON core uses min_region_sz parameter value as the DAMON region alignment. The alignment is made using ALIGN() and ALIGN_DOWN(), which support only the power of two alignments. But DAMON core API callers can set min_region_sz to an arbitrary number. Users can also set it indirectly, using addr_unit. When the alignment is not properly set, DAMON behavior becomes difficult to expect and understand, makes it effectively broken. It doesn't cause a kernel crash-like significant issue, though. Fix the issue by disallowing min_region_sz input that is not a power of two. Add the check to damon_commit_ctx(), as all DAMON API callers who set min_region_sz uses the function. This can be a sort of behavioral change, but it does not break users, for the following reasons. As the symptom is making DAMON effectively broken, it is not reasonable to believe there are real use cases of non-power of two min_region_sz. There is no known use case or issue reports from the setup, either. In future, if we find real use cases of non-power of two alignments and we can support it with low enough overhead, we can consider moving the restriction. But, for now, simply disallowing the corner case should be good enough as a hot fix. Link: https://lkml.kernel.org/r/20260214214124.87689-1-sj@kernel.org Fixes: d8f867fa0825 ("mm/damon: add damon_ctx->min_sz_region") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Quanmin Yan <yanquanmin1@huawei.com> Cc: <stable@vger.kernel.org> [6.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24liveupdate: luo_file: remember retrieve() statusPratyush Yadav (Google)1-1/+6
LUO keeps track of successful retrieve attempts on a LUO file. It does so to avoid multiple retrievals of the same file. Multiple retrievals cause problems because once the file is retrieved, the serialized data structures are likely freed and the file is likely in a very different state from what the code expects. The retrieve boolean in struct luo_file keeps track of this, and is passed to the finish callback so it knows what work was already done and what it has left to do. All this works well when retrieve succeeds. When it fails, luo_retrieve_file() returns the error immediately, without ever storing anywhere that a retrieve was attempted or what its error code was. This results in an errored LIVEUPDATE_SESSION_RETRIEVE_FD ioctl to userspace, but nothing prevents it from trying this again. The retry is problematic for much of the same reasons listed above. The file is likely in a very different state than what the retrieve logic normally expects, and it might even have freed some serialization data structures. Attempting to access them or free them again is going to break things. For example, if memfd managed to restore 8 of its 10 folios, but fails on the 9th, a subsequent retrieve attempt will try to call kho_restore_folio() on the first folio again, and that will fail with a warning since it is an invalid operation. Apart from the retry, finish() also breaks. Since on failure the retrieved bool in luo_file is never touched, the finish() call on session close will tell the file handler that retrieve was never attempted, and it will try to access or free the data structures that might not exist, much in the same way as the retry attempt. There is no sane way of attempting the retrieve again. Remember the error retrieve returned and directly return it on a retry. Also pass this status code to finish() so it can make the right decision on the work it needs to do. This is done by changing the bool to an integer. A value of 0 means retrieve was never attempted, a positive value means it succeeded, and a negative value means it failed and the error code is the value. Link: https://lkml.kernel.org/r/20260216132221.987987-1-pratyush@kernel.org Fixes: 7c722a7f44e0 ("liveupdate: luo_file: implement file systems callbacks") Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24mm: thp: deny THP for files on anonymous inodesDeepanshu Kartikey1-0/+3
file_thp_enabled() incorrectly allows THP for files on anonymous inodes (e.g. guest_memfd and secretmem). These files are created via alloc_file_pseudo(), which does not call get_write_access() and leaves inode->i_writecount at 0. Combined with S_ISREG(inode->i_mode) being true, they appear as read-only regular files when CONFIG_READ_ONLY_THP_FOR_FS is enabled, making them eligible for THP collapse. Anonymous inodes can never pass the inode_is_open_for_write() check since their i_writecount is never incremented through the normal VFS open path. The right thing to do is to exclude them from THP eligibility altogether, since CONFIG_READ_ONLY_THP_FOR_FS was designed for real filesystem files (e.g. shared libraries), not for pseudo-filesystem inodes. For guest_memfd, this allows khugepaged and MADV_COLLAPSE to create large folios in the page cache via the collapse path, but the guest_memfd fault handler does not support large folios. This triggers WARN_ON_ONCE(folio_test_large(folio)) in kvm_gmem_fault_user_mapping(). For secretmem, collapse_file() tries to copy page contents through the direct map, but secretmem pages are removed from the direct map. This can result in a kernel crash: BUG: unable to handle page fault for address: ffff88810284d000 RIP: 0010:memcpy_orig+0x16/0x130 Call Trace: collapse_file hpage_collapse_scan_file madvise_collapse Secretmem is not affected by the crash on upstream as the memory failure recovery handles the failed copy gracefully, but it still triggers confusing false memory failure reports: Memory failure: 0x106d96f: recovery action for clean unevictable LRU page: Recovered Check IS_ANON_FILE(inode) in file_thp_enabled() to deny THP for all anonymous inode files. Link: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Link: https://lore.kernel.org/linux-mm/CAEvNRgHegcz3ro35ixkDw39ES8=U6rs6S7iP0gkR9enr7HoGtA@mail.gmail.com Link: https://lkml.kernel.org/r/20260214001535.435626-1-kartikey406@gmail.com Fixes: 7fbb5e188248 ("mm: remove VM_EXEC requirement for THP eligibility") Signed-off-by: Deepanshu Kartikey <Kartikey406@gmail.com> Reported-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=33a04338019ac7e43a44 Tested-by: syzbot+33a04338019ac7e43a44@syzkaller.appspotmail.com Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Ackerley Tng <ackerleytng@google.com> Tested-by: Ackerley Tng <ackerleytng@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Fangrui Song <i@maskray.me> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-24mm/kfence: disable KFENCE upon KASAN HW tags enablementAlexander Potapenko1-0/+15
KFENCE does not currently support KASAN hardware tags. As a result, the two features are incompatible when enabled simultaneously. Given that MTE provides deterministic protection and KFENCE is a sampling-based debugging tool, prioritize the stronger hardware protections. Disable KFENCE initialization and free the pre-allocated pool if KASAN hardware tags are detected to ensure the system maintains the security guarantees provided by MTE. Link: https://lkml.kernel.org/r/20260213095410.1862978-1-glider@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ernesto Martinez Garcia <ernesto.martinezgarcia@tugraz.at> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-22Convert remaining multi-line kmalloc_obj/flex GFP_KERNEL usesKees Cook2-2/+2
Conversion performed via this Coccinelle script: // SPDX-License-Identifier: GPL-2.0-only // Options: --include-headers-for-types --all-includes --include-headers --keep-comments virtual patch @gfp depends on patch && !(file in "tools") && !(file in "samples")@ identifier ALLOC = {kmalloc_obj,kmalloc_objs,kmalloc_flex, kzalloc_obj,kzalloc_objs,kzalloc_flex, kvmalloc_obj,kvmalloc_objs,kvmalloc_flex, kvzalloc_obj,kvzalloc_objs,kvzalloc_flex}; @@ ALLOC(... - , GFP_KERNEL ) $ make coccicheck MODE=patch COCCI=gfp.cocci Build and boot tested x86_64 with Fedora 42's GCC and Clang: Linux version 6.19.0+ (user@host) (gcc (GCC) 15.2.1 20260123 (Red Hat 15.2.1-7), GNU ld version 2.44-12.fc42) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Linux version 6.19.0+ (user@host) (clang version 20.1.8 (Fedora 20.1.8-4.fc42), LLD 20.1.8) #1 SMP PREEMPT_DYNAMIC 1970-01-01 Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert more 'alloc_obj' cases to default GFP_KERNEL argumentsLinus Torvalds9-34/+17
This converts some of the visually simpler cases that have been split over multiple lines. I only did the ones that are easy to verify the resulting diff by having just that final GFP_KERNEL argument on the next line. Somebody should probably do a proper coccinelle script for this, but for me the trivial script actually resulted in an assertion failure in the middle of the script. I probably had made it a bit _too_ trivial. So after fighting that far a while I decided to just do some of the syntactically simpler cases with variations of the previous 'sed' scripts. The more syntactically complex multi-line cases would mostly really want whitespace cleanup anyway. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_flex' family to use the new default GFP_KERNEL argumentLinus Torvalds3-3/+3
This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the *alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs*'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-22Convert 'alloc_obj' family to use the new default GFP_KERNEL argumentLinus Torvalds29-77/+77
This was done entirely with mindless brute force, using git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' | xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21Merge tag 'kmalloc_obj-treewide-v7.0-rc1' of ↵Linus Torvalds44-168/+159
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull kmalloc_obj conversion from Kees Cook: "This does the tree-wide conversion to kmalloc_obj() and friends using coccinelle, with a subsequent small manual cleanup of whitespace alignment that coccinelle does not handle. This uncovered a clang bug in __builtin_counted_by_ref(), so the conversion is preceded by disabling that for current versions of clang. The imminent clang 22.1 release has the fix. I've done allmodconfig build tests for x86_64, arm64, i386, and arm. I did defconfig builds for alpha, m68k, mips, parisc, powerpc, riscv, s390, sparc, sh, arc, csky, xtensa, hexagon, and openrisc" * tag 'kmalloc_obj-treewide-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: kmalloc_obj: Clean up after treewide replacements treewide: Replace kmalloc with kmalloc_obj for non-scalar types compiler_types: Disable __builtin_counted_by_ref for Clang
2026-02-21Merge tag 'fixes-2026-02-21' of ↵Linus Torvalds1-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fix from Mike Rapoport: "Fix detection of NUMA node for CXL windows phys_to_target_node() may assign a CXL Fixed Memory Window to the wrong NUMA node when a CXL node resides in the gap of discontinuous System RAM node. Fix this by checking both numa_meminfo and numa_reserved_meminfo, preferring the reserved NID when the address appears in both" * tag 'fixes-2026-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm: numa_memblks: Identify the accurate NUMA ID of CFMW
2026-02-21treewide: Replace kmalloc with kmalloc_obj for non-scalar typesKees Cook44-168/+159
This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...) (where TYPE may also be *VAR) The resulting allocations no longer return "void *", instead returning "TYPE *". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-19Merge tag 'mm-stable-2026-02-18-19-48' of ↵Linus Torvalds20-195/+380
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - "mm/vmscan: fix demotion targets checks in reclaim/demotion" fixes a couple of issues in the demotion code - pages were failed demotion and were finding themselves demoted into disallowed nodes (Bing Jiao) - "Remove XA_ZERO from error recovery of dup_mmap()" fixes a rare mapledtree race and performs a number of cleanups (Liam Howlett) - "mm: add bitmap VMA flag helpers and convert all mmap_prepare to use them" implements a lot of cleanups following on from the conversion of the VMA flags into a bitmap (Lorenzo Stoakes) - "support batch checking of references and unmapping for large folios" implements batching to greatly improve the performance of reclaiming clean file-backed large folios (Baolin Wang) - "selftests/mm: add memory failure selftests" does as claimed (Miaohe Lin) * tag 'mm-stable-2026-02-18-19-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (36 commits) mm/page_alloc: clear page->private in free_pages_prepare() selftests/mm: add memory failure dirty pagecache test selftests/mm: add memory failure clean pagecache test selftests/mm: add memory failure anonymous page test mm: rmap: support batched unmapping for file large folios arm64: mm: implement the architecture-specific clear_flush_young_ptes() arm64: mm: support batch clearing of the young flag for large folios arm64: mm: factor out the address and ptep alignment into a new helper mm: rmap: support batched checks of the references for large folios tools/testing/vma: add VMA userland tests for VMA flag functions tools/testing/vma: separate out vma_internal.h into logical headers tools/testing/vma: separate VMA userland tests into separate files mm: make vm_area_desc utilise vma_flags_t only mm: update all remaining mmap_prepare users to use vma_flags_t mm: update shmem_[kernel]_file_*() functions to use vma_flags_t mm: update secretmem to use VMA flags on mmap_prepare mm: update hugetlbfs to use VMA flags on mmap_prepare mm: add basic VMA flag operation helper functions tools: bitmap: add missing bitmap_[subset(), andnot()] mm: add mk_vma_flags() bitmap flag macro helper ...
2026-02-17Merge tag 'slab-for-7.0-part2' of ↵Linus Torvalds2-29/+73
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull more slab updates from Vlastimil Babka: - Two stable fixes for kmalloc_nolock() usage from NMI context (Harry Yoo) - Allow kmalloc_nolock() allocations to be freed with kfree() and thus also kfree_rcu() and simplify slabobj_ext handling - we no longer need to track how it was allocated to use the matching freeing function (Harry Yoo) * tag 'slab-for-7.0-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: drop the OBJEXTS_NOSPIN_ALLOC flag from enum objext_flags mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]() mm/slab: use prandom if !allow_spin mm/slab: do not access current->mems_allowed_seq if !allow_spin
2026-02-14Merge tag 'memblock-v7.0-rc1' of ↵Linus Torvalds4-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - update tools/include/linux/mm.h to fix memblock tests compilation - drop redundant struct page* parameter from memblock_free_pages() and get struct page from the pfn - add underflow detection for size calculation in memtest and warn about underflow when VM_DEBUG is enabled * tag 'memblock-v7.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm/memtest: add underflow detection for size calculation memblock: drop redundant 'struct page *' argument from memblock_free_pages() memblock test: include <linux/sizes.h> from tools mm.h stub
2026-02-14mm: numa_memblks: Identify the accurate NUMA ID of CFMWCui Chao1-4/+5
In some physical memory layout designs, the address space of CFMW (CXL Fixed Memory Window) resides between multiple segments of system memory belonging to the same NUMA node. In numa_cleanup_meminfo, these multiple segments of system memory are merged into a larger numa_memblk. When identifying which NUMA node the CFMW belongs to, it may be incorrectly assigned to the NUMA node of the merged system memory. When a CXL RAM region is created in userspace, the memory capacity of the newly created region is not added to the CFMW-dedicated NUMA node. Instead, it is accumulated into an existing NUMA node (e.g., NUMA0 containing RAM). This makes it impossible to clearly distinguish between the two types of memory, which may affect memory-tiering applications. Example memory layout: Physical address space: 0x00000000 - 0x1FFFFFFF System RAM (node0) 0x20000000 - 0x2FFFFFFF CXL CFMW (node2) 0x40000000 - 0x5FFFFFFF System RAM (node0) 0x60000000 - 0x7FFFFFFF System RAM (node1) After numa_cleanup_meminfo, the two node0 segments are merged into one: 0x00000000 - 0x5FFFFFFF System RAM (node0) // CFMW is inside the range 0x60000000 - 0x7FFFFFFF System RAM (node1) So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0. To address this scenario, accurately identifying the correct NUMA node can be achieved by checking whether the region belongs to both numa_meminfo and numa_reserved_meminfo. While this issue is only observed in a QEMU configuration, and no known end users are impacted by this problem, it is likely that some firmware implementation is leaving memory map holes in a CXL Fixed Memory Window. CXL hotplug depends on mapping free window capacity, and it seems to be only a coincidence to have not hit this problem yet. Fixes: 779dd20cfb56 ("cxl/region: Add region creation support") Signed-off-by: Cui Chao <cuichao1753@phytium.com.cn> Cc: stable@vger.kernel.org Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Gregory Price <gourry@gourry.net> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://patch.msgid.link/20260213060347.2389818-2-cuichao1753@phytium.com.cn Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
2026-02-13Merge tag 'mm-hotfixes-stable-2026-02-13-07-14' of ↵Linus Torvalds2-6/+20
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "Three MM hotfixes, all three are cc:stable" * tag 'mm-hotfixes-stable-2026-02-13-07-14' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: procfs: fix possible double mmput() in do_procmap_query() mm/page_alloc: skip debug_check_no_{obj,locks}_freed with FPI_TRYLOCK mm/hugetlb: restore failed global reservations to subpool
2026-02-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-9/+0
Pull KVM updates from Paolo Bonzini: "Loongarch: - Add more CPUCFG mask bits - Improve feature detection - Add lazy load support for FPU and binary translation (LBT) register state - Fix return value for memory reads from and writes to in-kernel devices - Add support for detecting preemption from within a guest - Add KVM steal time test case to tools/selftests ARM: - Add support for FEAT_IDST, allowing ID registers that are not implemented to be reported as a normal trap rather than as an UNDEF exception - Add sanitisation of the VTCR_EL2 register, fixing a number of UXN/PXN/XN bugs in the process - Full handling of RESx bits, instead of only RES0, and resulting in SCTLR_EL2 being added to the list of sanitised registers - More pKVM fixes for features that are not supposed to be exposed to guests - Make sure that MTE being disabled on the pKVM host doesn't give it the ability to attack the hypervisor - Allow pKVM's host stage-2 mappings to use the Force Write Back version of the memory attributes by using the "pass-through' encoding - Fix trapping of ICC_DIR_EL1 on GICv5 hosts emulating GICv3 for the guest - Preliminary work for guest GICv5 support - A bunch of debugfs fixes, removing pointless custom iterators stored in guest data structures - A small set of FPSIMD cleanups - Selftest fixes addressing the incorrect alignment of page allocation - Other assorted low-impact fixes and spelling fixes RISC-V: - Fixes for issues discoverd by KVM API fuzzing in kvm_riscv_aia_imsic_has_attr(), kvm_riscv_aia_imsic_rw_attr(), and kvm_riscv_vcpu_aia_imsic_update() - Allow Zalasr, Zilsd and Zclsd extensions for Guest/VM - Transparent huge page support for hypervisor page tables - Adjust the number of available guest irq files based on MMIO register sizes found in the device tree or the ACPI tables - Add RISC-V specific paging modes to KVM selftests - Detect paging mode at runtime for selftests s390: - Performance improvement for vSIE (aka nested virtualization) - Completely new memory management. s390 was a special snowflake that enlisted help from the architecture's page table management to build hypervisor page tables, in particular enabling sharing the last level of page tables. This however was a lot of code (~3K lines) in order to support KVM, and also blocked several features. The biggest advantages is that the page size of userspace is completely independent of the page size used by the guest: userspace can mix normal pages, THPs and hugetlbfs as it sees fit, and in fact transparent hugepages were not possible before. It's also now possible to have nested guests and guests with huge pages running on the same host - Maintainership change for s390 vfio-pci - Small quality of life improvement for protected guests x86: - Add support for giving the guest full ownership of PMU hardware (contexted switched around the fastpath run loop) and allowing direct access to data MSRs and PMCs (restricted by the vPMU model). KVM still intercepts access to control registers, e.g. to enforce event filtering and to prevent the guest from profiling sensitive host state. This is more accurate, since it has no risk of contention and thus dropped events, and also has significantly less overhead. For more information, see the commit message for merge commit bf2c3138ae36 ("Merge tag 'kvm-x86-pmu-6.20' ...") - Disallow changing the virtual CPU model if L2 is active, for all the same reasons KVM disallows change the model after the first KVM_RUN - Fix a bug where KVM would incorrectly reject host accesses to PV MSRs when running with KVM_CAP_ENFORCE_PV_FEATURE_CPUID enabled, even if those were advertised as supported to userspace, - Fix a bug with protected guest state (SEV-ES/SNP and TDX) VMs, where KVM would attempt to read CR3 configuring an async #PF entry - Fail the build if EXPORT_SYMBOL_GPL or EXPORT_SYMBOL is used in KVM (for x86 only) to enforce usage of EXPORT_SYMBOL_FOR_KVM_INTERNAL. Only a few exports that are intended for external usage, and those are allowed explicitly - When checking nested events after a vCPU is unblocked, ignore -EBUSY instead of WARNing. Userspace can sometimes put the vCPU into what should be an impossible state, and spurious exit to userspace on -EBUSY does not really do anything to solve the issue - Also throw in the towel and drop the WARN on INIT/SIPI being blocked when vCPU is in Wait-For-SIPI, which also resulted in playing whack-a-mole with syzkaller stuffing architecturally impossible states into KVM - Add support for new Intel instructions that don't require anything beyond enumerating feature flags to userspace - Grab SRCU when reading PDPTRs in KVM_GET_SREGS2 - Add WARNs to guard against modifying KVM's CPU caps outside of the intended setup flow, as nested VMX in particular is sensitive to unexpected changes in KVM's golden configuration - Add a quirk to allow userspace to opt-in to actually suppress EOI broadcasts when the suppression feature is enabled by the guest (currently limited to split IRQCHIP, i.e. userspace I/O APIC). Sadly, simply fixing KVM to honor Suppress EOI Broadcasts isn't an option as some userspaces have come to rely on KVM's buggy behavior (KVM advertises Supress EOI Broadcast irrespective of whether or not userspace I/O APIC supports Directed EOIs) - Clean up KVM's handling of marking mapped vCPU pages dirty - Drop a pile of *ancient* sanity checks hidden behind in KVM's unused ASSERT() macro, most of which could be trivially triggered by the guest and/or user, and all of which were useless - Fold "struct dest_map" into its sole user, "struct rtc_status", to make it more obvious what the weird parameter is used for, and to allow fropping these RTC shenanigans if CONFIG_KVM_IOAPIC=n - Bury all of ioapic.h, i8254.h and related ioctls (including KVM_CREATE_IRQCHIP) behind CONFIG_KVM_IOAPIC=y - Add a regression test for recent APICv update fixes - Handle "hardware APIC ISR", a.k.a. SVI, updates in kvm_apic_update_apicv() to consolidate the updates, and to co-locate SVI updates with the updates for KVM's own cache of ISR information - Drop a dead function declaration - Minor cleanups x86 (Intel): - Rework KVM's handling of VMCS updates while L2 is active to temporarily switch to vmcs01 instead of deferring the update until the next nested VM-Exit. The deferred updates approach directly contributed to several bugs, was proving to be a maintenance burden due to the difficulty in auditing the correctness of deferred updates, and was polluting "struct nested_vmx" with a growing pile of booleans - Fix an SGX bug where KVM would incorrectly try to handle EPCM page faults, and instead always reflect them into the guest. Since KVM doesn't shadow EPCM entries, EPCM violations cannot be due to KVM interference and can't be resolved by KVM - Fix a bug where KVM would register its posted interrupt wakeup handler even if loading kvm-intel.ko ultimately failed - Disallow access to vmcb12 fields that aren't fully supported, mostly to avoid weirdness and complexity for FRED and other features, where KVM wants enable VMCS shadowing for fields that conditionally exist - Print out the "bad" offsets and values if kvm-intel.ko refuses to load (or refuses to online a CPU) due to a VMCS config mismatch x86 (AMD): - Drop a user-triggerable WARN on nested_svm_load_cr3() failure - Add support for virtualizing ERAPS. Note, correct virtualization of ERAPS relies on an upcoming, publicly announced change in the APM to reduce the set of conditions where hardware (i.e. KVM) *must* flush the RAP - Ignore nSVM intercepts for instructions that are not supported according to L1's virtual CPU model - Add support for expedited writes to the fast MMIO bus, a la VMX's fastpath for EPT Misconfig - Don't set GIF when clearing EFER.SVME, as GIF exists independently of SVM, and allow userspace to restore nested state with GIF=0 - Treat exit_code as an unsigned 64-bit value through all of KVM - Add support for fetching SNP certificates from userspace - Fix a bug where KVM would use vmcb02 instead of vmcb01 when emulating VMLOAD or VMSAVE on behalf of L2 - Misc fixes and cleanups x86 selftests: - Add a regression test for TPR<=>CR8 synchronization and IRQ masking - Overhaul selftest's MMU infrastructure to genericize stage-2 MMU support, and extend x86's infrastructure to support EPT and NPT (for L2 guests) - Extend several nested VMX tests to also cover nested SVM - Add a selftest for nested VMLOAD/VMSAVE - Rework the nested dirty log test, originally added as a regression test for PML where KVM logged L2 GPAs instead of L1 GPAs, to improve test coverage and to hopefully make the test easier to understand and maintain guest_memfd: - Remove kvm_gmem_populate()'s preparation tracking and half-baked hugepage handling. SEV/SNP was the only user of the tracking and it can do it via the RMP - Retroactively document and enforce (for SNP) that KVM_SEV_SNP_LAUNCH_UPDATE and KVM_TDX_INIT_MEM_REGION require the source page to be 4KiB aligned, to avoid non-trivial complexity for something that no known VMM seems to be doing and to avoid an API special case for in-place conversion, which simply can't support unaligned sources - When populating guest_memfd memory, GUP the source page in common code and pass the refcounted page to the vendor callback, instead of letting vendor code do the heavy lifting. Doing so avoids a looming deadlock bug with in-place due an AB-BA conflict betwee mmap_lock and guest_memfd's filemap invalidate lock Generic: - Fix a bug where KVM would ignore the vCPU's selected address space when creating a vCPU-specific mapping of guest memory. Actually this bug could not be hit even on x86, the only architecture with multiple address spaces, but it's a bug nevertheless" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (267 commits) KVM: s390: Increase permitted SE header size to 1 MiB MAINTAINERS: Replace backup for s390 vfio-pci KVM: s390: vsie: Fix race in acquire_gmap_shadow() KVM: s390: vsie: Fix race in walk_guest_tables() KVM: s390: Use guest address to mark guest page dirty irqchip/riscv-imsic: Adjust the number of available guest irq files RISC-V: KVM: Transparent huge page support RISC-V: KVM: selftests: Add Zalasr extensions to get-reg-list test RISC-V: KVM: Allow Zalasr extensions for Guest/VM KVM: riscv: selftests: Add riscv vm satp modes KVM: riscv: selftests: add Zilsd and Zclsd extension to get-reg-list test riscv: KVM: allow Zilsd and Zclsd extensions for Guest/VM RISC-V: KVM: Skip IMSIC update if vCPU IMSIC state is not initialized RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_rw_attr() RISC-V: KVM: Fix null pointer dereference in kvm_riscv_aia_imsic_has_attr() RISC-V: KVM: Remove unnecessary 'ret' assignment KVM: s390: Add explicit padding to struct kvm_s390_keyop KVM: LoongArch: selftests: Add steal time test case LoongArch: KVM: Add paravirt vcpu_is_preempted() support in guest side LoongArch: KVM: Add paravirt preempt feature in hypervisor side ...
2026-02-13mm/page_alloc: clear page->private in free_pages_prepare()Mikhail Gavrilov1-0/+1
Several subsystems (slub, shmem, ttm, etc.) use page->private but don't clear it before freeing pages. When these pages are later allocated as high-order pages and split via split_page(), tail pages retain stale page->private values. This causes a use-after-free in the swap subsystem. The swap code uses page->private to track swap count continuations, assuming freshly allocated pages have page->private == 0. When stale values are present, swap_count_continued() incorrectly assumes the continuation list is valid and iterates over uninitialized page->lru containing LIST_POISON values, causing a crash: KASAN: maybe wild-memory-access in range [0xdead000000000100-0xdead000000000107] RIP: 0010:__do_sys_swapoff+0x1151/0x1860 Fix this by clearing page->private in free_pages_prepare(), ensuring all freed pages have clean state regardless of previous use. Link: https://lkml.kernel.org/r/20260207173615.146159-1-mikhail.v.gavrilov@gmail.com Fixes: 3b8000ae185c ("mm/vmalloc: huge vmalloc backing pages should be split rather than compound") Signed-off-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Suggested-by: Zi Yan <ziy@nvidia.com> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Brendan Jackman <jackmanb@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: rmap: support batched unmapping for file large foliosBaolin Wang1-3/+7
Similar to folio_referenced_one(), we can apply batched unmapping for file large folios to optimize the performance of file folios reclamation. Barry previously implemented batched unmapping for lazyfree anonymous large folios[1] and did not further optimize anonymous large folios or file-backed large folios at that stage. As for file-backed large folios, the batched unmapping support is relatively straightforward, as we only need to clear the consecutive (present) PTE entries for file-backed large folios. Note that it's not ready to support batched unmapping for uffd case, so let's still fallback to per-page unmapping for the uffd case. Performance testing: Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to reclaim 8G file-backed folios via the memory.reclaim interface. I can observe 75% performance improvement on my Arm64 32-core server (and 50%+ improvement on my X86 machine) with this patch. W/o patch: real 0m1.018s user 0m0.000s sys 0m1.018s W/ patch: real 0m0.249s user 0m0.000s sys 0m0.249s [1] https://lore.kernel.org/all/20250214093015.51024-4-21cnbao@gmail.com/T/#u Link: https://lkml.kernel.org/r/b53a16f67c93a3fe65e78092069ad135edf00eff.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: rmap: support batched checks of the references for large foliosBaolin Wang1-3/+25
Patch series "support batch checking of references and unmapping for large folios", v6. Currently, folio_referenced_one() always checks the young flag for each PTE sequentially, which is inefficient for large folios. This inefficiency is especially noticeable when reclaiming clean file-backed large folios, where folio_referenced() is observed as a significant performance hotspot. Moreover, on Arm architecture, which supports contiguous PTEs, there is already an optimization to clear the young flags for PTEs within a contiguous range. However, this is not sufficient. We can extend this to perform batched operations for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). Similar to folio_referenced_one(), we can also apply batched unmapping for large file folios to optimize the performance of file folio reclamation. By supporting batched checking of the young flags, flushing TLB entries, and unmapping, I can observed a significant performance improvements in my performance tests for file folios reclamation. Please check the performance data in the commit message of each patch. This patch (of 5): Currently, folio_referenced_one() always checks the young flag for each PTE sequentially, which is inefficient for large folios. This inefficiency is especially noticeable when reclaiming clean file-backed large folios, where folio_referenced() is observed as a significant performance hotspot. Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already an optimization to clear the young flags for PTEs within a contiguous range. However, this is not sufficient. We can extend this to perform batched operations for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE). Introduce a new API: clear_flush_young_ptes() to facilitate batched checking of the young flags and flushing TLB entries, thereby improving performance during large folio reclamation. And it will be overridden by the architecture that implements a more efficient batch operation in the following patches. While we are at it, rename ptep_clear_flush_young_notify() to clear_flush_young_ptes_notify() to indicate that this is a batch operation. Link: https://lkml.kernel.org/r/cover.1770645603.git.baolin.wang@linux.alibaba.com Link: https://lkml.kernel.org/r/12132694536834262062d1fb304f8f8a064b6750.1770645603.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Barry Song <baohua@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: make vm_area_desc utilise vma_flags_t onlyLorenzo Stoakes4-8/+10
Now we have eliminated all uses of vm_area_desc->vm_flags, eliminate this field, and have mmap_prepare users utilise the vma_flags_t vm_area_desc->vma_flags field only. As part of this change we alter is_shared_maywrite() to accept a vma_flags_t parameter, and introduce is_shared_maywrite_vm_flags() for use with legacy vm_flags_t flags. We also update struct mmap_state to add a union between vma_flags and vm_flags temporarily until the mmap logic is also converted to using vma_flags_t. Also update the VMA userland tests to reflect this change. Link: https://lkml.kernel.org/r/fd2a2938b246b4505321954062b1caba7acfc77a.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: update all remaining mmap_prepare users to use vma_flags_tLorenzo Stoakes1-9/+8
We will be shortly removing the vm_flags_t field from vm_area_desc so we need to update all mmap_prepare users to only use the dessc->vma_flags field. This patch achieves that and makes all ancillary changes required to make this possible. This lays the groundwork for future work to eliminate the use of vm_flags_t in vm_area_desc altogether and more broadly throughout the kernel. While we're here, we take the opportunity to replace VM_REMAP_FLAGS with VMA_REMAP_FLAGS, the vma_flags_t equivalent. No functional changes intended. Link: https://lkml.kernel.org/r/fb1f55323799f09fe6a36865b31550c9ec67c225.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Damien Le Moal <dlemoal@kernel.org> [zonefs] Acked-by: "Darrick J. Wong" <djwong@kernel.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: update shmem_[kernel]_file_*() functions to use vma_flags_tLorenzo Stoakes2-28/+35
In order to be able to use only vma_flags_t in vm_area_desc we must adjust shmem file setup functions to operate in terms of vma_flags_t rather than vm_flags_t. This patch makes this change and updates all callers to use the new functions. No functional changes intended. [akpm@linux-foundation.org: comment fixes, per Baolin] Link: https://lkml.kernel.org/r/736febd280eb484d79cef5cf55b8a6f79ad832d2.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: update secretmem to use VMA flags on mmap_prepareLorenzo Stoakes5-12/+12
This patch updates secretmem to use the new vma_flags_t type which will soon supersede vm_flags_t altogether. In order to make this change we also have to update mlock_future_ok(), we replace the vm_flags_t parameter with a simple boolean is_vma_locked one, which also simplifies the invocation here. This is laying the groundwork for eliminating the vm_flags_t in vm_area_desc and more broadly throughout the kernel. No functional changes intended. [lorenzo.stoakes@oracle.com: fix check_brk_limits(), per Chris] Link: https://lkml.kernel.org/r/3aab9ab1-74b4-405e-9efb-08fc2500c06e@lucifer.local Link: https://lkml.kernel.org/r/a243a09b0a5d0581e963d696de1735f61f5b2075.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: update hugetlbfs to use VMA flags on mmap_prepareLorenzo Stoakes3-14/+14
In order to update all mmap_prepare users to utilising the new VMA flags type vma_flags_t and associated helper functions, we start by updating hugetlbfs which has a lot of additional logic that requires updating to make this change. This is laying the groundwork for eliminating the vm_flags_t from struct vm_area_desc and using vma_flags_t only, which further lays the ground for removing the deprecated vm_flags_t type altogether. No functional changes intended. Link: https://lkml.kernel.org/r/9226bec80c9aa3447cc2b83354f733841dba8a50.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: rename vma_flag_test/set_atomic() to vma_test/set_atomic_flag()Lorenzo Stoakes2-2/+2
In order to stay consistent between functions which manipulate a vm_flags_t argument of the form of vma_flags_...() and those which manipulate a VMA (in this case the flags field of a VMA), rename vma_flag_[test/set]_atomic() to vma_[test/set]_atomic_flag(). This lays the groundwork for adding VMA flag manipulation functions in a subsequent commit. Link: https://lkml.kernel.org/r/033dcf12e819dee5064582bced9b12ea346d1607.1769097829.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Yury Norov <ynorov@nvidia.com> Cc: Chris Mason <clm@fb.com> Cc: Pedro Falcato <pfalcato@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: use unmap_desc struct for freeing page tablesLiam R. Howlett5-31/+42
Pass through the unmap_desc to free_pgtables() because it almost has everything necessary and is already on the stack. Updates testing code as necessary. No functional changes intended. [Liam.Howlett@oracle.com: fix up unmap desc use on exit_mmap()] Link: https://lkml.kernel.org/r/20260210214214.364856-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20260121164946.2093480-12-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vma: use unmap_region() in vms_clear_ptes()Liam R. Howlett1-13/+1
There is no need to open code the vms_clear_ptes() now that unmap_desc struct is used. Link: https://lkml.kernel.org/r/20260121164946.2093480-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vma: use unmap_desc in exit_mmap() and vms_clear_ptes()Liam R. Howlett5-18/+50
Convert vms_clear_ptes() to use unmap_desc to call unmap_vmas() instead of the large argument list. The UNMAP_STATE() cannot be used because the vma iterator in the vms does not point to the correct maple state (mas_detach), and the tree_end will be set incorrectly. Setting up the arguments manually avoids setting the struct up incorrectly and doing extra work to get the correct pagetable range. exit_mmap() also calls unmap_vmas() with many arguments. Using the unmap_all_init() function to set the unmap descriptor for all vmas makes this a bit easier to read. Update to the vma test code is necessary to ensure testing continues to function. No functional changes intended. Link: https://lkml.kernel.org/r/20260121164946.2093480-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: introduce unmap_desc struct to reduce function argumentsLiam R. Howlett3-23/+51
The unmap_region code uses a number of arguments that could use better documentation. With the addition of a descriptor for unmap (called unmap_desc), the arguments can be more self-documenting and increase the descriptions within the declaration. No functional change intended Link: https://lkml.kernel.org/r/20260121164946.2093480-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: change dup_mmap() recoveryLiam R. Howlett2-16/+32
When the dup_mmap() fails during the vma duplication or setup, don't write the XA_ZERO entry in the vma tree. Instead, destroy the tree and free the new resources, leaving an empty vma tree. Using XA_ZERO introduced races where the vma could be found between dup_mmap() dropping all locks and exit_mmap() taking the locks. The race can occur because the mm can be reached through the other trees via successfully copied vmas and other methods such as the swapoff code. XA_ZERO was marking the location to stop vma removal and pagetable freeing. The newly created arguments to the unmap_vmas() and free_pgtables() serve this function. Replacing the XA_ZERO entry use with the new argument list also means the checks for xa_is_zero() are no longer necessary so these are also removed. Note that dup_mmap() now cleans up when ALL vmas are successfully copied, but the dup_mmap() fails to completely set up some other aspect of the duplication. Link: https://lkml.kernel.org/r/20260121164946.2093480-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vma: add page table limit to unmap_region()Liam R. Howlett2-4/+6
The unmap_region() calls need to pass through the page table limit for a future patch. No functional changes intended. Link: https://lkml.kernel.org/r/20260121164946.2093480-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/memory: add tree limit to free_pgtables()Liam R. Howlett4-13/+40
The ceiling and tree search limit need to be different arguments for the future change in the failed fork attempt. The ceiling and floor variables are not very descriptive, so change them to pg_start/pg_end. Adding a new variable for the vma_end to the function as it will differ from the pg_end in the later patches in the series. Add a kernel doc about the free_pgtables() function. Test code also updated. No functional changes intended. Link: https://lkml.kernel.org/r/20260121164946.2093480-6-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vma: add limits to unmap_region() for vmasLiam R. Howlett2-2/+5
Add a limit to the vma search instead of using the start and end of the one passed in. No functional changes intended. Link: https://lkml.kernel.org/r/20260121164946.2093480-5-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/mmap: abstract vma clean up from exit_mmap()Liam R. Howlett1-13/+24
Create the new function tear_down_vmas() to remove a range of vmas. exit_mmap() will be removing all the vmas. This is necessary for future patches. No functional changes intended. Link: https://lkml.kernel.org/r/20260121164946.2093480-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/mmap: move exit_mmap() trace pointLiam R. Howlett1-1/+1
Move the trace point later in the function so that it is not skipped in the event of a failed fork. Link: https://lkml.kernel.org/r/20260121164946.2093480-3-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: relocate the page table ceiling and floor definitionsLiam R. Howlett1-0/+1
Patch series " Remove XA_ZERO from error recovery of dup_mmap()", v3. It is possible that the dup_mmap() call fails on allocating or setting up a vma after the maple tree of the oldmm is copied. Today, that failure point is marked by inserting an XA_ZERO entry over the failure point so that the exact location does not need to be communicated through to exit_mmap(). However, a race exists in the tear down process because the dup_mmap() drops the mmap lock before exit_mmap() can remove the partially set up vma tree. This means that other tasks may get to the mm tree and find the invalid vma pointer (since it's an XA_ZERO entry), even though the mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE. To remove the race fully, the tree must be cleaned up before dropping the lock. This is accomplished by extracting the vma cleanup in exit_mmap() and changing the required functions to pass through the vma search limit. Any other tree modifications would require extra cycles which should be spent on freeing memory. This does run the risk of increasing the possibility of finding no vmas (which is already possible!) in code that isn't careful. The final four patches are to address the excessive argument lists being passed between the functions. Using the struct unmap_desc also allows some special-case code to be removed in favour of the struct setup differences. This patch (of 11): pgtables.h defines a fallback for ceiling and floor of the page tables within the CONFIG_MMU section. Moving the definitions to outside the CONFIG_MMU allows for using them in generic code. [akpm@linux-foundation.org: remove stray newline, per SeongJae] Link: https://lkml.kernel.org/r/20260121164946.2093480-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20260121164946.2093480-2-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: SeongJae Park <sj@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm: folio_zero_user: open code range computation in folio_zero_user()Ankur Arora1-8/+7
riscv64-gcc-linux-gnu (v8.5) reports a compile time assert in: r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - radius, pg.start, pg.end), clamp_t(s64, fault_idx + radius, pg.start, pg.end)); where it decides that pg.start > pg.end in: clamp_t(s64, fault_idx + radius, pg.start, pg.end)); where pg comes from: const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1); That does not seem like it could be true. Even for pg.start == pg.end, we would need folio_test_large() to evaluate to false at compile time: static inline unsigned long folio_nr_pages(const struct folio *folio) { if (!folio_test_large(folio)) return 1; return folio_large_nr_pages(folio); } Workaround by open coding the range computation. Also, simplify the type declarations for the relevant variables. [ankur.a.arora@oracle.com: remove unneeded cast, per David] Link: https://lkml.kernel.org/r/20260206223801.2617497-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260206223801.2617497-1-ankur.a.arora@oracle.com Link: https://lkml.kernel.org/r/20260128185943.2397128-1-ankur.a.arora@oracle.com Fixes: 93552c9a3350 ("mm: folio_zero_user: cache neighbouring pages") Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202601240453.QCjgGdJa-lkp@intel.com/ Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Li Zhe <lizhe.67@bytedance.com> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Raghavendra K T <raghavendra.kt@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vmscan: select the closest preferred node in demote_folio_list()Bing Jiao2-8/+18
The preferred demotion node (migration_target_control.nid) should be the one closest to the source node to minimize migration latency. Currently, a discrepancy exists where demote_folio_list() randomly selects an allowed node if the preferred node from next_demotion_node() is not set in mems_effective. To address it, update next_demotion_node() to select a preferred target against allowed nodes; and to return the closest demotion target if all preferred nodes are not in mems_effective via next_demotion_node(). It ensures that the preferred demotion target is consistently the closest available node to the source node. [akpm@linux-foundation.org: fix comment typo, per Shakeel] Link: https://lkml.kernel.org/r/20260114205305.2869796-3-bingjiao@google.com Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/vmscan: fix demotion targets checks in reclaim/demotionBing Jiao2-14/+36
Patch series "mm/vmscan: fix demotion targets checks in reclaim/demotion", v9. This patch series addresses two issues in demote_folio_list(), can_demote(), and next_demotion_node() in reclaim/demotion. 1. demote_folio_list() and can_demote() do not correctly check demotion target against cpuset.mems_effective, which will cause (a) pages to be demoted to not-allowed nodes and (b) pages fail demotion even if the system still has allowed demotion nodes. Patch 1 fixes this bug by updating cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. 2. next_demotion_node() returns a preferred demotion target, but it does not check the node against allowed nodes. Patch 2 ensures that next_demotion_node() filters against the allowed node mask and selects the closest demotion target to the source node. This patch (of 2): Fix two bugs in demote_folio_list() and can_demote() due to incorrect demotion target checks against cpuset.mems_effective in reclaim/demotion. Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") introduces the cpuset.mems_effective check and applies it to can_demote(). However: 1. It does not apply this check in demote_folio_list(), which leads to situations where pages are demoted to nodes that are explicitly excluded from the task's cpuset.mems. 2. It checks only the nodes in the immediate next demotion hierarchy and does not check all allowed demotion targets in can_demote(). This can cause pages to never be demoted if the nodes in the next demotion hierarchy are not set in mems_effective. These bugs break resource isolation provided by cpuset.mems. This is visible from userspace because pages can either fail to be demoted entirely or are demoted to nodes that are not allowed in multi-tier memory systems. To address these bugs, update cpuset_node_allowed() and mem_cgroup_node_allowed() to return effective_mems, allowing directly logic-and operation against demotion targets. Also update can_demote() and demote_folio_list() accordingly. Bug 1 reproduction: Assume a system with 4 nodes, where nodes 0-1 are top-tier and nodes 2-3 are far-tier memory. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Should respect node 0-2 limit. # Observation: Node 3 shows significant allocation (MemFree drops) stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1 Bug 2 reproduction: Assume a system with 6 nodes, where nodes 0-2 are top-tier, node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes. All nodes have equal capacity. Test script: echo 1 > /sys/kernel/mm/numa/demotion_enabled mkdir /sys/fs/cgroup/test echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems echo $$ > /sys/fs/cgroup/test/cgroup.procs swapoff -a # Expectation: Pages are demoted to Nodes 4-5 # Observation: No pages are demoted before oom. stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2 Link: https://lkml.kernel.org/r/20260114205305.2869796-1-bingjiao@google.com Link: https://lkml.kernel.org/r/20260114205305.2869796-2-bingjiao@google.com Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Gregory Price <gourry@gourry.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/page_alloc: skip debug_check_no_{obj,locks}_freed with FPI_TRYLOCKHarry Yoo1-6/+11
When CONFIG_DEBUG_OBJECTS_FREE is enabled, debug_check_no_{obj,locks}_freed() functions are called. Since both of them spin on a lock, they are not safe to be called if the FPI_TRYLOCK flag is specified. This leads to a lockdep splat: ================================ WARNING: inconsistent lock state 6.19.0-rc5-slab-for-next+ #326 Tainted: G N -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. kunit_try_catch/9046 [HC2[2]:SC0[0]:HE0:SE1] takes: ffffffff84ed6bf8 (&obj_hash[i].lock){-.-.}-{2:2}, at: __debug_check_no_obj_freed+0xe0/0x300 {INITIAL USE} state was registered at: lock_acquire+0xd9/0x2f0 _raw_spin_lock_irqsave+0x4c/0x80 __debug_object_init+0x9d/0x1f0 debug_object_init+0x34/0x50 __init_work+0x28/0x40 init_cgroup_housekeeping+0x151/0x210 init_cgroup_root+0x3d/0x140 cgroup_init_early+0x30/0x240 start_kernel+0x3e/0xcd0 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0xf3/0x140 common_startup_64+0x13e/0x148 irq event stamp: 2998 hardirqs last enabled at (2997): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240 hardirqs last disabled at (2998): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110 softirqs last enabled at (1416): [<ffffffff813c1f72>] __irq_exit_rcu+0x132/0x1c0 softirqs last disabled at (1303): [<ffffffff813c1f72>] __irq_exit_rcu+0x132/0x1c0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&obj_hash[i].lock); <Interrupt> lock(&obj_hash[i].lock); *** DEADLOCK *** Rename free_pages_prepare() to __free_pages_prepare(), add an fpi_t parameter, and skip those checks if FPI_TRYLOCK is set. To keep the fpi_t definition in mm/page_alloc.c, add a wrapper function free_pages_prepare() that always passes FPI_NONE and use it in mm/compaction.c. Link: https://lkml.kernel.org/r/20260209062639.16577-1-harry.yoo@oracle.com Fixes: 8c57b687e833 ("mm, bpf: Introduce free_pages_nolock()") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-13mm/hugetlb: restore failed global reservations to subpoolJoshua Hahn1-0/+9
Commit a833a693a490 ("mm: hugetlb: fix incorrect fallback for subpool") fixed an underflow error for hstate->resv_huge_pages caused by incorrectly attributing globally requested pages to the subpool's reservation. Unfortunately, this fix also introduced the opposite problem, which would leave spool->used_hpages elevated if the globally requested pages could not be acquired. This is because while a subpool's reserve pages only accounts for what is requested and allocated from the subpool, its "used" counter keeps track of what is consumed in total, both from the subpool and globally. Thus, we need to adjust spool->used_hpages in the other direction, and make sure that globally requested pages are uncharged from the subpool's used counter. Each failed allocation attempt increments the used_hpages counter by how many pages were requested from the global pool. Ultimately, this renders the subpool unusable, as used_hpages approaches the max limit. The issue can be reproduced as follows: 1. Allocate 4 hugetlb pages 2. Create a hugetlb mount with max=4, min=2 3. Consume 2 pages globally 4. Request 3 pages from the subpool (2 from subpool + 1 from global) 4.1 hugepage_subpool_get_pages(spool, 3) succeeds. used_hpages += 3 4.2 hugetlb_acct_memory(h, 1) fails: no global pages left used_hpages -= 2 5. Subpool now has used_hpages = 1, despite not being able to successfully allocate any hugepages. It believes it can now only allocate 3 more hugepages, not 4. With each failed allocation attempt incrementing the used counter, the subpool eventually reaches a point where its used counter equals its max counter. At that point, any future allocations that try to allocate hugeTLB pages from the subpool will fail, despite the subpool not having any of its hugeTLB pages consumed by any user. Once this happens, there is no way to make the subpool usable again, since there is no way to decrement the used counter as no process is really consuming the hugeTLB pages. The underflow issue that the original commit fixes still remains fixed as well. Without this fix, used_hpages would keep on leaking if hugetlb_acct_memory() fails. Link: https://lkml.kernel.org/r/20260116204037.2270096-1-joshua.hahnjy@gmail.com Fixes: a833a693a490 ("mm: hugetlb: fix incorrect fallback for subpool") Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Usama Arif <usama.arif@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-02-12Merge tag 'mm-nonmm-stable-2026-02-12-10-48' of ↵Linus Torvalds6-10/+8
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves disk space by teaching ocfs2 to reclaim suballocator block group space (Heming Zhao) - "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the ARRAY_END() macro and uses it in various places (Alejandro Colomar) - "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the page size (Pnina Feder) - "kallsyms: Prevent invalid access when showing module buildid" cleans up kallsyms code related to module buildid and fixes an invalid access crash when printing backtraces (Petr Mladek) - "Address page fault in ima_restore_measurement_list()" fixes a kexec-related crash that can occur when booting the second-stage kernel on x86 (Harshit Mogalapalli) - "kho: ABI headers and Documentation updates" updates the kexec handover ABI documentation (Mike Rapoport) - "Align atomic storage" adds the __aligned attribute to atomic_t and atomic64_t definitions to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain) - "kho: clean up page initialization logic" simplifies the page initialization logic in kho_restore_page() (Pratyush Yadav) - "Unload linux/kernel.h" moves several things out of kernel.h and into more appropriate places (Yury Norov) - "don't abuse task_struct.group_leader" removes the usage of ->group_leader when it is "obviously unnecessary" (Oleg Nesterov) - "list private v2 & luo flb" adds some infrastructure improvements to the live update orchestrator (Pasha Tatashin) * tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits) watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency procfs: fix missing RCU protection when reading real_parent in do_task_stat() watchdog/softlockup: fix sample ring index wrap in need_counting_irqs() kcsan, compiler_types: avoid duplicate type issues in BPF Type Format kho: fix doc for kho_restore_pages() tests/liveupdate: add in-kernel liveupdate test liveupdate: luo_flb: introduce File-Lifecycle-Bound global state liveupdate: luo_file: Use private list list: add kunit test for private list primitives list: add primitives for private list manipulations delayacct: fix uapi timespec64 definition panic: add panic_force_cpu= parameter to redirect panic to a specific CPU netclassid: use thread_group_leader(p) in update_classid_task() RDMA/umem: don't abuse current->group_leader drm/pan*: don't abuse current->group_leader drm/amd: kill the outdated "Only the pthreads threading model is supported" checks drm/amdgpu: don't abuse current->group_leader android/binder: use same_thread_group(proc->tsk, current) in binder_mmap() android/binder: don't abuse current->group_leader kho: skip memoryless NUMA nodes when reserving scratch areas ...
2026-02-12Merge tag 'mm-stable-2026-02-11-19-22' of ↵Linus Torvalds87-2510/+3376
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - "powerpc/64s: do not re-activate batched TLB flush" makes arch_{enter|leave}_lazy_mmu_mode() nest properly (Alexander Gordeev) It adds a generic enter/leave layer and switches architectures to use it. Various hacks were removed in the process. - "zram: introduce compressed data writeback" implements data compression for zram writeback (Richard Chang and Sergey Senozhatsky) - "mm: folio_zero_user: clear page ranges" adds clearing of contiguous page ranges for hugepages. Large improvements during demand faulting are demonstrated (David Hildenbrand) - "memcg cleanups" tidies up some memcg code (Chen Ridong) - "mm/damon: introduce {,max_}nr_snapshots and tracepoint for damos stats" improves DAMOS stat's provided information, deterministic control, and readability (SeongJae Park) - "selftests/mm: hugetlb cgroup charging: robustness fixes" fixes a few issues in the hugetlb cgroup charging selftests (Li Wang) - "Fix va_high_addr_switch.sh test failure - again" addresses several issues in the va_high_addr_switch test (Chunyu Hu) - "mm/damon/tests/core-kunit: extend existing test scenarios" improves the KUnit test coverage for DAMON (Shu Anzai) - "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE" fixes a glitch in khugepaged which was causing madvise(MADV_COLLAPSE) to transiently return -EAGAIN (Shivank Garg) - "arch, mm: consolidate hugetlb early reservation" reworks and consolidates a pile of straggly code related to reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb (Mike Rapoport) - "mm: clean up anon_vma implementation" cleans up the anon_vma implementation in various ways (Lorenzo Stoakes) - "tweaks for __alloc_pages_slowpath()" does a little streamlining of the page allocator's slowpath code (Vlastimil Babka) - "memcg: separate private and public ID namespaces" cleans up the memcg ID code and prevents the internal-only private IDs from being exposed to userspace (Shakeel Butt) - "mm: hugetlb: allocate frozen gigantic folio" cleans up the allocation of frozen folios and avoids some atomic refcount operations (Kefeng Wang) - "mm/damon: advance DAMOS-based LRU sorting" improves DAMOS's movement of memory betewwn the active and inactive LRUs and adds auto-tuning of the ratio-based quotas and of monitoring intervals (SeongJae Park) - "Support page table check on PowerPC" makes CONFIG_PAGE_TABLE_CHECK_ENFORCED work on powerpc (Andrew Donnellan) - "nodemask: align nodes_and{,not} with underlying bitmap ops" makes nodes_and() and nodes_andnot() propagate the return values from the underlying bit operations, enabling some cleanup in calling code (Yury Norov) - "mm/damon: hide kdamond and kdamond_lock from API callers" cleans up some DAMON internal interfaces (SeongJae Park) - "mm/khugepaged: cleanups and scan limit fix" does some cleanup work in khupaged and fixes a scan limit accounting issue (Shivank Garg) - "mm: balloon infrastructure cleanups" goes to town on the balloon infrastructure and its page migration function. Mainly cleanups, also some locking simplification (David Hildenbrand) - "mm/vmscan: add tracepoint and reason for kswapd_failures reset" adds additional tracepoints to the page reclaim code (Jiayuan Chen) - "Replace wq users and add WQ_PERCPU to alloc_workqueue() users" is part of Marco's kernel-wide migration from the legacy workqueue APIs over to the preferred unbound workqueues (Marco Crivellari) - "Various mm kselftests improvements/fixes" provides various unrelated improvements/fixes for the mm kselftests (Kevin Brodsky) - "mm: accelerate gigantic folio allocation" greatly speeds up gigantic folio allocation, mainly by avoiding unnecessary work in pfn_range_valid_contig() (Kefeng Wang) - "selftests/damon: improve leak detection and wss estimation reliability" improves the reliability of two of the DAMON selftests (SeongJae Park) - "mm/damon: cleanup kdamond, damon_call(), damos filter and DAMON_MIN_REGION" does some cleanup work in the core DAMON code (SeongJae Park) - "Docs/mm/damon: update intro, modules, maintainer profile, and misc" performs maintenance work on the DAMON documentation (SeongJae Park) - "mm: add and use vma_assert_stabilised() helper" refactors and cleans up the core VMA code. The main aim here is to be able to use the mmap write lock's lockdep state to perform various assertions regarding the locking which the VMA code requires (Lorenzo Stoakes) - "mm, swap: swap table phase II: unify swapin use" removes some old swap code (swap cache bypassing and swap synchronization) which wasn't working very well. Various other cleanups and simplifications were made. The end result is a 20% speedup in one benchmark (Kairui Song) - "enable PT_RECLAIM on more 64-bit architectures" makes PT_RECLAIM available on 64-bit alpha, loongarch, mips, parisc, and um. Various cleanups were performed along the way (Qi Zheng) * tag 'mm-stable-2026-02-11-19-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (325 commits) mm/memory: handle non-split locks correctly in zap_empty_pte_table() mm: move pte table reclaim code to memory.c mm: make PT_RECLAIM depends on MMU_GATHER_RCU_TABLE_FREE mm: convert __HAVE_ARCH_TLB_REMOVE_TABLE to CONFIG_HAVE_ARCH_TLB_REMOVE_TABLE config um: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h mm/damon/stat: remove __read_mostly from memory_idle_ms_percentiles zsmalloc: make common caches global mm: add SPDX id lines to some mm source files mm/zswap: use %pe to print error pointers mm/vmscan: use %pe to print error pointers mm/readahead: fix typo in comment mm: khugepaged: fix NR_FILE_PAGES and NR_SHMEM in collapse_file() mm: refactor vma_map_pages to use vm_insert_pages mm/damon: unify address range representation with damon_addr_range mm/cma: replace snprintf with strscpy in cma_new_area ...