summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2026-03-04mm/highmem: fix __kmap_to_page() build errorWilliam Tambe1-1/+2
[ Upstream commit 94350fe6cad77b46c3dcb8c96543bef7647efbc0 ] This changes fixes following build error which is a miss from ef6e06b2ef87 ("highmem: fix kmap_to_page() for kmap_local_page() addresses"). mm/highmem.c:184:66: error: 'pteval' undeclared (first use in this function); did you mean 'pte_val'? 184 | idx = arch_kmap_local_map_idx(i, pte_pfn(pteval)); In __kmap_to_page(), pteval is used but does not exist in the function. (akpm: affects xtensa only) Link: https://lkml.kernel.org/r/SJ0PR07MB86317E00EC0C59DA60935FDCD18DA@SJ0PR07MB8631.namprd07.prod.outlook.com Fixes: ef6e06b2ef87 ("highmem: fix kmap_to_page() for kmap_local_page() addresses") Signed-off-by: William Tambe <williamt@cadence.com> Reviewed-by: Max Filippov <jcmvbkbc@gmail.com> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-03-04mm, page_alloc, thp: prevent reclaim for __GFP_THISNODE THP allocationsVlastimil Babka1-0/+14
[ Upstream commit 9c9828d3ead69416d731b1238802af31760c823e ] Since commit cc638f329ef6 ("mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations"), THP page fault allocations have settled on the following scheme (from the commit log): 1. local node only THP allocation with no reclaim, just compaction. 2. for madvised VMA's or when synchronous compaction is enabled always - THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Recent customer reports however revealed we have a gap in step 1 above. What we have seen is excessive reclaim due to THP page faults on a NUMA node that's close to its high watermark, while other nodes have plenty of free memory. The problem with step 1 is that it promises no reclaim after the compaction attempt, however reclaim is only avoided for certain compaction outcomes (deferred, or skipped due to insufficient free base pages), and not e.g. when compaction is actually performed but fails (we did see compact_fail vmstat counter increasing). THP page faults can therefore exhibit a zone_reclaim_mode-like behavior, which is not the intention. Thus add a check for __GFP_THISNODE that corresponds to this exact situation and prevents continuing with reclaim/compaction once the initial compaction attempt isn't successful in allocating the page. Note that commit cc638f329ef6 has not introduced this over-reclaim possibility; it appears to exist in some form since commit 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations"). Followup commits b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") and cc638f329ef6 have moved in the right direction, but left the abovementioned gap. Link: https://lkml.kernel.org/r/20251219-costly-noretry-thisnode-fix-v1-1-e1085a4a0c34@suse.cz Fixes: 2f0799a0ffc0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: "David Hildenbrand (Red Hat)" <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2026-02-06mm/kfence: randomize the freelist on initializationPimyn Girgis1-4/+19
[ Upstream commit 870ff19251bf3910dda7a7245da826924045fedd ] Randomize the KFENCE freelist during pool initialization to make allocation patterns less predictable. This is achieved by shuffling the order in which metadata objects are added to the freelist using get_random_u32_below(). Additionally, ensure the error path correctly calculates the address range to be reset if initialization fails, as the address increment logic has been moved to a separate loop. Link: https://lkml.kernel.org/r/20260120161510.3289089-1-pimyn@google.com Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure") Signed-off-by: Pimyn Girgis <pimyn@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Ernesto Martnez Garca <ernesto.martinezgarcia@tugraz.at> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Kees Cook <kees@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ replaced kfence_metadata_init with kfence_metadata ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06Revert "mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()"Harry Yoo1-43/+58
This reverts commit 91750c8a4be42d73b6810a1c35d73c8a3cd0b481 which is commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream. While the commit fixes a race condition between NUMA balancing and THP migration, it causes a NULL-pointer-deref when the pmd temporarily transitions from pmd_trans_huge() to pmd_none(). Verifying whether the pmd value has changed under page table lock does not prevent the crash, as it occurs when acquiring the lock. Since the original issue addressed by the commit is quite rare and non-fatal, revert the commit. A better backport solution that more closely matches the upstream semantics will be provided as a follow-up. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm: kmsan: fix poisoning of high-order non-compound pagesRyan Roberts1-2/+1
[ Upstream commit 4795d205d78690a46b60164f44b8bb7b3e800865 ] kmsan_free_page() is called by the page allocator's free_pages_prepare() during page freeing. Its job is to poison all the memory covered by the page. It can be called with an order-0 page, a compound high-order page or a non-compound high-order page. But page_size() only works for order-0 and compound pages. For a non-compound high-order page it will incorrectly return PAGE_SIZE. The implication is that the tail pages of a high-order non-compound page do not get poisoned at free, so any invalid access while they are free could go unnoticed. It looks like the pages will be poisoned again at allocation time, so that would bookend the window. Fix this by using the order parameter to calculate the size. Link: https://lkml.kernel.org/r/20260104134348.3544298-1-ryan.roberts@arm.com Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/page_alloc: prevent pcp corruption with SMP=nVlastimil Babka1-4/+33
[ Upstream commit 038a102535eb49e10e93eafac54352fcc5d78847 ] The kernel test robot has reported: BUG: spinlock trylock failure on UP on CPU#0, kcompactd0/28 lock: 0xffff888807e35ef0, .magic: dead4ead, .owner: kcompactd0/28, .owner_cpu: 0 CPU: 0 UID: 0 PID: 28 Comm: kcompactd0 Not tainted 6.18.0-rc5-00127-ga06157804399 #1 PREEMPT 8cc09ef94dcec767faa911515ce9e609c45db470 Call Trace: <IRQ> __dump_stack (lib/dump_stack.c:95) dump_stack_lvl (lib/dump_stack.c:123) dump_stack (lib/dump_stack.c:130) spin_dump (kernel/locking/spinlock_debug.c:71) do_raw_spin_trylock (kernel/locking/spinlock_debug.c:?) _raw_spin_trylock (include/linux/spinlock_api_smp.h:89 kernel/locking/spinlock.c:138) __free_frozen_pages (mm/page_alloc.c:2973) ___free_pages (mm/page_alloc.c:5295) __free_pages (mm/page_alloc.c:5334) tlb_remove_table_rcu (include/linux/mm.h:? include/linux/mm.h:3122 include/asm-generic/tlb.h:220 mm/mmu_gather.c:227 mm/mmu_gather.c:290) ? __cfi_tlb_remove_table_rcu (mm/mmu_gather.c:289) ? rcu_core (kernel/rcu/tree.c:?) rcu_core (include/linux/rcupdate.h:341 kernel/rcu/tree.c:2607 kernel/rcu/tree.c:2861) rcu_core_si (kernel/rcu/tree.c:2879) handle_softirqs (arch/x86/include/asm/jump_label.h:36 include/trace/events/irq.h:142 kernel/softirq.c:623) __irq_exit_rcu (arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:725) irq_exit_rcu (kernel/softirq.c:741) sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1052) </IRQ> <TASK> RIP: 0010:_raw_spin_unlock_irqrestore (arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:194) free_pcppages_bulk (mm/page_alloc.c:1494) drain_pages_zone (include/linux/spinlock.h:391 mm/page_alloc.c:2632) __drain_all_pages (mm/page_alloc.c:2731) drain_all_pages (mm/page_alloc.c:2747) kcompactd (mm/compaction.c:3115) kthread (kernel/kthread.c:465) ? __cfi_kcompactd (mm/compaction.c:3166) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork (arch/x86/kernel/process.c:164) ? __cfi_kthread (kernel/kthread.c:412) ret_from_fork_asm (arch/x86/entry/entry_64.S:255) </TASK> Matthew has analyzed the report and identified that in drain_page_zone() we are in a section protected by spin_lock(&pcp->lock) and then get an interrupt that attempts spin_trylock() on the same lock. The code is designed to work this way without disabling IRQs and occasionally fail the trylock with a fallback. However, the SMP=n spinlock implementation assumes spin_trylock() will always succeed, and thus it's normally a no-op. Here the enabled lock debugging catches the problem, but otherwise it could cause a corruption of the pcp structure. The problem has been introduced by commit 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations"). The pcp locking scheme recognizes the need for disabling IRQs to prevent nesting spin_trylock() sections on SMP=n, but the need to prevent the nesting in spin_lock() has not been recognized. Fix it by introducing local wrappers that change the spin_lock() to spin_lock_iqsave() with SMP=n and use them in all places that do spin_lock(&pcp->lock). [vbabka@suse.cz: add pcp_ prefix to the spin_lock_irqsave wrappers, per Steven] Link: https://lkml.kernel.org/r/20260105-fix-pcp-up-v1-1-5579662d2071@suse.cz Fixes: 574907741599 ("mm/page_alloc: leave IRQs enabled for per-cpu page allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202512101320.e2f2dd6f-lkp@intel.com Analyzed-by: Matthew Wilcox <willy@infradead.org> Link: https://lore.kernel.org/all/aUW05pyc9nZkvY-1@casper.infradead.org/ Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ drop changes to decay_pcp_high() and zone_pcp_update_cacheinfo() ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/rmap: fix two comments related to huge_pmd_unshare()David Hildenbrand (Red Hat)1-16/+4
[ Upstream commit a8682d500f691b6dfaa16ae1502d990aeb86e8be ] PMD page table unsharing no longer touches the refcount of a PMD page table. Also, it is not about dropping the refcount of a "PMD page" but the "PMD page table". Let's just simplify by saying that the PMD page table was unmapped, consequently also unmapping the folio that was mapped into this page. This code should be deduplicated in the future. Link: https://lkml.kernel.org/r/20251223214037.580860-4-david@kernel.org Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/damon/sysfs-scheme: cleanup access_pattern subdirs on scheme dir setup ↵SeongJae Park1-2/+3
failure commit 392b3d9d595f34877dd745b470c711e8ebcd225c upstream. When a DAMOS-scheme DAMON sysfs directory setup fails after setup of access_pattern/ directory, subdirectories of access_pattern/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-5-sj@kernel.org Fixes: 9bbb820a5bd5 ("mm/damon/sysfs: support DAMOS quotas") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/damon/sysfs-scheme: cleanup quotas subdirs on scheme dir setup failureSeongJae Park1-2/+3
commit dc7e1d75fd8c505096d0cddeca9e2efb2b55aaf9 upstream. When a DAMOS-scheme DAMON sysfs directory setup fails after setup of quotas/ directory, subdirectories of quotas/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-4-sj@kernel.org Fixes: 1b32234ab087 ("mm/damon/sysfs: support DAMOS watermarks") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06migrate: correct lock ordering for hugetlb file foliosMatthew Wilcox (Oracle)1-6/+6
commit b7880cb166ab62c2409046b2347261abf701530e upstream. Syzbot has found a deadlock (analyzed by Lance Yang): 1) Task (5749): Holds folio_lock, then tries to acquire i_mmap_rwsem(read lock). 2) Task (5754): Holds i_mmap_rwsem(write lock), then tries to acquire folio_lock. migrate_pages() -> migrate_hugetlbs() -> unmap_and_move_huge_page() <- Takes folio_lock! -> remove_migration_ptes() -> __rmap_walk_file() -> i_mmap_lock_read() <- Waits for i_mmap_rwsem(read lock)! hugetlbfs_fallocate() -> hugetlbfs_punch_hole() <- Takes i_mmap_rwsem(write lock)! -> hugetlbfs_zero_partial_page() -> filemap_lock_hugetlb_folio() -> filemap_lock_folio() -> __filemap_get_folio <- Waits for folio_lock! The migration path is the one taking locks in the wrong order according to the documentation at the top of mm/rmap.c. So expand the scope of the existing i_mmap_lock to cover the calls to remove_migration_ptes() too. This is (mostly) how it used to be after commit c0d0381ade79. That was removed by 336bf30eb765 for both file & anon hugetlb pages when it should only have been removed for anon hugetlb pages. Link: https://lkml.kernel.org/r/20260109041345.3863089-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Fixes: 336bf30eb765 ("hugetlbfs: fix anon huge page migration race") Reported-by: syzbot+2d9c96466c978346b55f@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/68e9715a.050a0220.1186a4.000d.GAE@google.com Debugged-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/damon/sysfs: cleanup attrs subdirs on context dir setup failureSeongJae Park1-2/+3
commit 9814cc832b88bd040fc2a1817c2b5469d0f7e862 upstream. When a context DAMON sysfs directory setup is failed after setup of attrs/ directory, subdirectories of attrs/ directory are not cleaned up. As a result, DAMON sysfs interface is nearly broken until the system reboots, and the memory for the unremoved directory is leaked. Cleanup the directories under such failures. Link: https://lkml.kernel.org/r/20251225023043.18579-3-sj@kernel.org Fixes: c951cd3b8901 ("mm/damon: implement a minimal stub for sysfs-based DAMON interface") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: chongjiapeng <jiapeng.chong@linux.alibaba.com> Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06mm/page_alloc: make percpu_pagelist_high_fraction reads lock-freeAboorva Devarajan1-1/+9
commit b9efe36b5e3eb2e91aa3d706066428648af034fc upstream. When page isolation loops indefinitely during memory offline, reading /proc/sys/vm/percpu_pagelist_high_fraction blocks on pcp_batch_high_lock, causing hung task warnings. Make procfs reads lock-free since percpu_pagelist_high_fraction is a simple integer with naturally atomic reads, writers still serialize via the mutex. This prevents hung task warnings when reading the procfs file during long-running memory offline operations. [akpm@linux-foundation.org: add comment, per Michal] Link: https://lkml.kernel.org/r/aS_y9AuJQFydLEXo@tiehlicka Link: https://lkml.kernel.org/r/20251201060009.1420792-1-aboorvad@linux.ibm.com Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-02-06x86/kaslr: Recognize all ZONE_DEVICE users as physaddr consumersDan Williams1-4/+8
commit 269031b15c1433ff39e30fa7ea3ab8f0be9d6ae2 upstream. Commit 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") is too narrow. The effect being mitigated in that commit is caused by ZONE_DEVICE which PCI_P2PDMA has a dependency. ZONE_DEVICE, in general, lets any physical address be added to the direct-map. I.e. not only ACPI hotplug ranges, CXL Memory Windows, or EFI Specific Purpose Memory, but also any PCI MMIO range for the DEVICE_PRIVATE and PCI_P2PDMA cases. Update the mitigation, limit KASLR entropy, to apply in all ZONE_DEVICE=y cases. Distro kernels typically have PCI_P2PDMA=y, so the practical exposure of this problem is limited to the PCI_P2PDMA=n case. A potential path to recover entropy would be to walk ACPI and determine the limits for hotplug and PCI MMIO before kernel_randomize_memory(). On smaller systems that could yield some KASLR address bits. This needs additional investigation to determine if some limited ACPI table scanning can happen this early without an open coded solution like arch/x86/boot/compressed/acpi.c needs to deploy. Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Tested-by: Yasunori Goto <y-goto@fujitsu.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: http://patch.msgid.link/692e08b2516d4_261c1100a3@dwillia2-mobl4.notmuch Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17ksm: use range-walk function to jump over holes in scan_get_next_rmap_itemPedro Demarchi Gomes1-13/+113
[ Upstream commit f5548c318d6520d4fa3c5ed6003eeb710763cbc5 ] Currently, scan_get_next_rmap_item() walks every page address in a VMA to locate mergeable pages. This becomes highly inefficient when scanning large virtual memory areas that contain mostly unmapped regions, causing ksmd to use large amount of cpu without deduplicating much pages. This patch replaces the per-address lookup with a range walk using walk_page_range(). The range walker allows KSM to skip over entire unmapped holes in a VMA, avoiding unnecessary lookups. This problem was previously discussed in [1]. Consider the following test program which creates a 32 TiB mapping in the virtual address space but only populates a single page: /* 32 TiB */ const size_t size = 32ul * 1024 * 1024 * 1024 * 1024; int main() { char *area = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0); if (area == MAP_FAILED) { perror("mmap() failed\n"); return -1; } /* Populate a single page such that we get an anon_vma. */ *area = 0; /* Enable KSM. */ madvise(area, size, MADV_MERGEABLE); pause(); return 0; } $ ./ksm-sparse & $ echo 1 > /sys/kernel/mm/ksm/run Without this patch ksmd uses 100% of the cpu for a long time (more then 1 hour in my test machine) scanning all the 32 TiB virtual address space that contain only one mapped page. This makes ksmd essentially deadlocked not able to deduplicate anything of value. With this patch ksmd walks only the one mapped page and skips the rest of the 32 TiB virtual address space, making the scan fast using little cpu. Link: https://lkml.kernel.org/r/20251023035841.41406-1-pedrodemargomes@gmail.com Link: https://lkml.kernel.org/r/20251022153059.22763-1-pedrodemargomes@gmail.com Link: https://lore.kernel.org/linux-mm/423de7a3-1c62-4e72-8e79-19a6413e420c@redhat.com/ [1] Fixes: 31dbd01f3143 ("ksm: Kernel SamePage Merging") Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: craftfever <craftfever@airmail.cc> Closes: https://lkml.kernel.org/r/020cf8de6e773bb78ba7614ef250129f11a63781@murena.io Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: xu xin <xu.xin16@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ replace pmdp_get_lockless with pmd_read_atomic and pmdp_get with READ_ONCE(*pmdp) ] Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-17mm/pagewalk: add walk_page_range_vma()David Hildenbrand1-0/+20
[ Upstream commit e07cda5f232fac4de0925d8a4c92e51e41fa2f6e ] Let's add walk_page_range_vma(), which is similar to walk_page_vma(), however, is only interested in a subset of the VMA range. To be used in KSM code to stop using follow_page() next. Link: https://lkml.kernel.org/r/20221021101141.84170-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: f5548c318d6 ("ksm: use range-walk function to jump over holes in scan_get_next_rmap_item") Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm: (un)track_pfn_copy() fix + doc improvementsDavid Hildenbrand1-1/+1
commit 8c56c5dbcf52220cc9be7a36e7f21ebd5939e0b9 upstream. We got a late smatch warning and some additional review feedback. smatch warnings: mm/memory.c:1428 copy_page_range() error: uninitialized symbol 'pfn'. We actually use the pfn only when it is properly initialized; however, we may pass an uninitialized value to a function -- although it will not use it that likely still is UB in C. So let's just fix it by always initializing pfn in the caller of track_pfn_copy(), and improving the documentation of track_pfn_copy(). While at it, clarify the doc of untrack_pfn_copy(), that internal checks make sure if we actually have to untrack anything. Link: https://lkml.kernel.org/r/20250408085950.976103-1-david@redhat.com Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/r/202503270941.IFILyNCX-lkp@intel.com/ Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()Hugh Dickins1-58/+43
commit 670ddd8cdcbd1d07a4571266ae3517f821728c3a upstream. change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(), required to avoid "bad" choices when setting automatic NUMA hinting under mmap_read_lock(); but most of that is already covered in pte_offset_map() now. change_pmd_range() just wants a pmd_none() check before wasting time on MMU notifiers, then checks on the read-once _pmd value to work out what's needed for huge cases. If change_pte_range() returns -EAGAIN to retry if pte_offset_map_lock() fails, nothing more special is needed. Link: https://lkml.kernel.org/r/725a42a9-91e9-c868-925-e3a5fd40bb4f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: SeongJae Park <sj@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zack Rusin <zackr@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Background: It was reported that a bad pmd is seen when automatic NUMA balancing is marking page table entries as prot_numa: [2437548.196018] mm/pgtable-generic.c:50: bad pmd 00000000af22fc02(dffffffe71fbfe02) [2437548.235022] Call Trace: [2437548.238234] <TASK> [2437548.241060] dump_stack_lvl+0x46/0x61 [2437548.245689] panic+0x106/0x2e5 [2437548.249497] pmd_clear_bad+0x3c/0x3c [2437548.253967] change_pmd_range.isra.0+0x34d/0x3a7 [2437548.259537] change_p4d_range+0x156/0x20e [2437548.264392] change_protection_range+0x116/0x1a9 [2437548.269976] change_prot_numa+0x15/0x37 [2437548.274774] task_numa_work+0x1b8/0x302 [2437548.279512] task_work_run+0x62/0x95 [2437548.283882] exit_to_user_mode_loop+0x1a4/0x1a9 [2437548.289277] exit_to_user_mode_prepare+0xf4/0xfc [2437548.294751] ? sysvec_apic_timer_interrupt+0x34/0x81 [2437548.300677] irqentry_exit_to_user_mode+0x5/0x25 [2437548.306153] asm_sysvec_apic_timer_interrupt+0x16/0x1b This is due to a race condition between change_prot_numa() and THP migration because the kernel doesn't check is_swap_pmd() and pmd_trans_huge() atomically: change_prot_numa() THP migration ====================================================================== - change_pmd_range() -> is_swap_pmd() returns false, meaning it's not a PMD migration entry. - do_huge_pmd_numa_page() -> migrate_misplaced_page() sets migration entries for the THP. - change_pmd_range() -> pmd_none_or_clear_bad_unless_trans_huge() -> pmd_none() and pmd_trans_huge() returns false - pmd_none_or_clear_bad_unless_trans_huge() -> pmd_bad() returns true for the migration entry! The upstream commit 670ddd8cdcbd ("mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()") closes this race condition by checking is_swap_pmd() and pmd_trans_huge() atomically. Backporting note: Unlike the mainline, pte_offset_map_lock() does not check if the pmd entry is a migration entry or a hugepage; acquires PTL unconditionally instead of returning failure. Therefore, it is necessary to keep the !is_swap_pmd() && !pmd_trans_huge() && !pmd_devmap() check before acquiring the PTL. After acquiring the lock, open-code the semantics of pte_offset_map_lock() in the mainline kernel; change_pte_range() fails if the pmd value has changed. This requires adding pmd_old parameter (pmd_t value that is read before calling the function) to change_pte_range(). ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/mprotect: use long for page accountings and retvalPeter Xu3-16/+16
commit a79390f5d6a78647fd70856bd42b22d994de0ba2 upstream. Switch to use type "long" for page accountings and retval across the whole procedure of change_protection(). The change should have shrinked the possible maximum page number to be half comparing to previous (ULONG_MAX / 2), but it shouldn't overflow on any system either because the maximum possible pages touched by change protection should be ULONG_MAX / PAGE_SIZE. Two reasons to switch from "unsigned long" to "long": 1. It suites better on count_vm_numa_events(), whose 2nd parameter takes a long type. 2. It paves way for returning negative (error) values in the future. Currently the only caller that consumes this retval is change_prot_numa(), where the unsigned long was converted to an int. Since at it, touching up the numa code to also take a long, so it'll avoid any possible overflow too during the int-size convertion. Link: https://lkml.kernel.org/r/20230104225207.1066932-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: James Houghton <jthoughton@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Adjust context ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()David Hildenbrand1-7/+4
[ Upstream commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31 ] If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables. Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied. The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong. So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation. Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run. A reproducer can trigger this usually at the first try: https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180 Likely this case was missed in: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") ... and instead of undoing the reservation we simply cleared the VM_PAT flag. Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately. Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yuxin wang <wang1315768607@163.com> Reported-by: Marius Fleischer <fleischermarius@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/ Signed-off-by: Sasha Levin <sashal@kernel.org> Cc: stable@vger.kernel.org [ Ajay: Modified to apply on v6.1 ] Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11x86/mm/pat: clear VM_PAT if copy_p4d_range failedMa Wupeng2-1/+2
[ Upstream commit d155df53f31068c3340733d586eb9b3ddfd70fc5 ] Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexander Ofitserov <oficerovas@altlinux.org> Cc: stable@vger.kernel.org [ Ajay: Modified to apply on v6.1 ] Signed-off-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize()David Hildenbrand1-1/+2
[ Upstream commit 15504b1163007bbfbd9a63460d5c14737c16e96d ] Let's move the removal of the page from the balloon list into the single caller, to remove the dependency on the PG_isolated flag and clarify locking requirements. Note that for now, balloon_page_delete() was used on two paths: (1) Removing a page from the balloon for deflation through balloon_page_list_dequeue() (2) Removing an isolated page from the balloon for migration in the per-driver migration handlers. Isolated pages were already removed from the balloon list during isolation. So instead of relying on the flag, we can just distinguish both cases directly and handle it accordingly in the caller. We'll shuffle the operations a bit such that they logically make more sense (e.g., remove from the list before clearing flags). In balloon migration functions we can now move the balloon_page_finalize() out of the balloon lock and perform the finalization just before dropping the balloon reference. Document that the page lock is currently required when modifying the movability aspects of a page; hopefully we can soon decouple this from the page lock. Link: https://lkml.kernel.org/r/20250704102524.326966-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 0da2ba35c0d5 ("powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/balloon_compaction: we cannot have isolated pages in the balloon listDavid Hildenbrand1-6/+0
[ Upstream commit fb05f992b6bbb4702307d96f00703ee637b24dbf ] Patch series "mm/migration: rework movable_ops page migration (part 1)", v2. In the future, as we decouple "struct page" from "struct folio", pages that support "non-lru page migration" -- movable_ops page migration such as memory balloons and zsmalloc -- will no longer be folios. They will not have ->mapping, ->lru, and likely no refcount and no page lock. But they will have a type and flags 🙂 This is the first part (other parts not written yet) of decoupling movable_ops page migration from folio migration. In this series, we get rid of the ->mapping usage, and start cleaning up the code + separating it from folio migration. Migration core will have to be further reworked to not treat movable_ops pages like folios. This is the first step into that direction. This patch (of 29): The core will set PG_isolated only after mops->isolate_page() was called. In case of the balloon, that is where we will remove it from the balloon list. So we cannot have isolated pages in the balloon list. Let's drop this unnecessary check. Link: https://lkml.kernel.org/r/20250704102524.326966-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Christian Brauner <brauner@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Eugenio Pé rez <eperezma@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jason Wang <jasowang@redhat.com> Cc: Jerrin Shaji George <jerrin.shaji-george@broadcom.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 0da2ba35c0d5 ("powerpc/pseries/cmm: adjust BALLOON_MIGRATE when migrating pages") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle alloc failures in damon_test_set_regions()SeongJae Park1-2/+15
commit 74d5969995d129fd59dd93b9c7daa6669cb6810f upstream. damon_test_set_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-11-sj@kernel.org Fixes: 62f409560eb2 ("mm/damon/core-test: test damon_set_regions") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle memory alloc failure from ↵SeongJae Park1-0/+11
damon_test_aggregate() commit f79f2fc44ebd0ed655239046be3e80e8804b5545 upstream. damon_test_aggregate() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-5-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle alloc failures on ↵SeongJae Park1-0/+20
damon_test_split_regions_of() commit eded254cb69044bd4abde87394ea44909708d7c0 upstream. damon_test_split_regions_of() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-9-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle memory failure from damon_test_target()SeongJae Park1-0/+7
commit fafe953de2c661907c94055a2497c6b8dbfd26f3 upstream. damon_test_target() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-4-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle alloc failures on damon_test_merge_two()SeongJae Park1-0/+10
commit 3d443dd29a1db7efa587a4bb0c06a497e13ca9e4 upstream. damon_test_merge_two() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-7-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle alloc failures on ↵SeongJae Park1-0/+6
dasmon_test_merge_regions_of() commit 0998d2757218771c59d5ca59ccf13d1542a38f17 upstream. damon_test_merge_regions_of() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-8-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle alloc failures on damon_test_split_at()SeongJae Park1-0/+11
commit 5e80d73f22043c59c8ad36452a3253937ed77955 upstream. damon_test_split_at() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-6-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/core-kunit: handle allocation failures in damon_test_regions()SeongJae Park1-0/+6
commit e16fdd4f754048d6e23c56bd8d920b71e41e3777 upstream. damon_test_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-3-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/vaddr-kunit: handle alloc failures on ↵SeongJae Park1-1/+8
damon_test_split_evenly_succ() commit 0a63a0e7570b9b2631dfb8d836dc572709dce39e upstream. damon_test_split_evenly_succ() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-20-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/vaddr-kunit: handle alloc failures on ↵SeongJae Park1-0/+6
damon_do_test_apply_three_regions() commit 2b22d0fcc6320ba29b2122434c1d2f0785fb0a25 upstream. damon_do_test_apply_three_regions() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-18-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11mm/damon/tests/vaddr-kunit: handle alloc failures in ↵SeongJae Park1-1/+10
damon_test_split_evenly_fail() commit 7890e5b5bb6e386155c6e755fe70e0cdcc77f18e upstream. damon_test_split_evenly_fail() is assuming all dynamic memory allocation in it will succeed. Those are indeed likely in the real use cases since those allocations are too small to fail, but theoretically those could fail. In the case, inappropriate memory access can happen. Fix it by appropriately cleanup pre-allocated memory and skip the execution of the remaining tests in the failure cases. Link: https://lkml.kernel.org/r/20251101182021.74868-19-sj@kernel.org Fixes: 17ccae8bb5c9 ("mm/damon: add kunit tests") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: <stable@vger.kernel.org> [5.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2026-01-11kasan: refactor pcpu kasan vmalloc unpoisonMaciej Wieczor-Retman2-3/+18
commit 6f13db031e27e88213381039032a9cc061578ea6 upstream. A KASAN tag mismatch, possibly causing a kernel panic, can be observed on systems with a tag-based KASAN enabled and with multiple NUMA nodes. It was reported on arm64 and reproduced on x86. It can be explained in the following points: 1. There can be more than one virtual memory chunk. 2. Chunk's base address has a tag. 3. The base address points at the first chunk and thus inherits the tag of the first chunk. 4. The subsequent chunks will be accessed with the tag from the first chunk. 5. Thus, the subsequent chunks need to have their tag set to match that of the first chunk. Refactor code by reusing __kasan_unpoison_vmalloc in a new helper in preparation for the actual fix. Link: https://lkml.kernel.org/r/eb61d93b907e262eefcaa130261a08bcb6c5ce51.1764874575.git.m.wieczorretman@pm.me Fixes: 1d96320f8d53 ("kasan, vmalloc: add vmalloc tagging for SW_TAGS") Signed-off-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Kees Cook <kees@kernel.org> Cc: Marco Elver <elver@google.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> [6.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07filemap: cap PTE range to be created to allowed zero fill in folio_map_range()Pankaj Raghav1-1/+6
[ Upstream commit 743a2753a02e805347969f6f89f38b736850d808 ] Usually the page cache does not extend beyond the size of the inode, therefore, no PTEs are created for folios that extend beyond the size. But with LBS support, we might extend page cache beyond the size of the inode as we need to guarantee folios of minimum order. While doing a read, do_fault_around() can create PTEs for pages that lie beyond the EOF leading to incorrect error return when accessing a page beyond the mapped file. Cap the PTE range to be created for the page cache up to the end of file(EOF) in filemap_map_pages() so that return error codes are consistent with POSIX[1] for LBS configurations. generic/749 has been created to trigger this edge case. This also fixes generic/749 for tmpfs with huge=always on systems with 4k base page size. [1](from mmap(2)) SIGBUS Attempted access to a page of the buffer that lies beyond the end of the mapped file. For an explanation of the treatment of the bytes in the page that corresponds to the end of a mapped file that is not a multiple of the page size, see NOTES. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20240822135018.1931258-6-kernel@pankajraghav.com Tested-by: David Howells <dhowells@redhat.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07mm/mempool: fix poisoning order>0 pages with HIGHMEMVlastimil Babka1-6/+26
[ Upstream commit ec33b59542d96830e3c89845ff833cf7b25ef172 ] The kernel test has reported: BUG: unable to handle page fault for address: fffba000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 03171067 *pte = 00000000 Oops: Oops: 0002 [#1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G T 6.18.0-rc2-00031-gec7f31b2a2d3 #1 NONE a1d066dfe789f54bc7645c7989957d2bdee593ca Tainted: [T]=RANDSTRUCT Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 EIP: memset (arch/x86/include/asm/string_32.h:168 arch/x86/lib/memcpy_32.c:17) Code: a5 8b 4d f4 83 e1 03 74 02 f3 a4 83 c4 04 5e 5f 5d 2e e9 73 41 01 00 90 90 90 3e 8d 74 26 00 55 89 e5 57 56 89 c6 89 d0 89 f7 <f3> aa 89 f0 5e 5f 5d 2e e9 53 41 01 00 cc cc cc 55 89 e5 53 57 56 EAX: 0000006b EBX: 00000015 ECX: 001fefff EDX: 0000006b ESI: fffb9000 EDI: fffba000 EBP: c611fbf0 ESP: c611fbe8 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00010287 CR0: 80050033 CR2: fffba000 CR3: 0316e000 CR4: 00040690 Call Trace: poison_element (mm/mempool.c:83 mm/mempool.c:102) mempool_init_node (mm/mempool.c:142 mm/mempool.c:226) mempool_init_noprof (mm/mempool.c:250 (discriminator 1)) ? mempool_alloc_pages (mm/mempool.c:640) bio_integrity_initfn (block/bio-integrity.c:483 (discriminator 8)) ? mempool_alloc_pages (mm/mempool.c:640) do_one_initcall (init/main.c:1283) Christoph found out this is due to the poisoning code not dealing properly with CONFIG_HIGHMEM because only the first page is mapped but then the whole potentially high-order page is accessed. We could give up on HIGHMEM here, but it's straightforward to fix this with a loop that's mapping, poisoning or checking and unmapping individual pages. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202511111411.9ebfa1ba-lkp@intel.com Analyzed-by: Christoph Hellwig <hch@lst.de> Fixes: bdfedb76f4f5 ("mm, mempool: poison elements backed by slab allocator") Cc: stable@vger.kernel.org Tested-by: kernel test robot <oliver.sang@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://patch.msgid.link/20251113-mempool-poison-v1-1-233b3ef984c3@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07mm/mempool: replace kmap_atomic() with kmap_local_page()Fabio M. De Francesco1-4/+4
[ Upstream commit f2bcc99a5e901a13b754648d1dbab60f4adf9375 ] kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings don't rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142640.7077-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: ec33b59542d9 ("mm/mempool: fix poisoning order>0 pages with HIGHMEM") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07mm/truncate: unmap large folio on split failureKiryl Shutsemau1-1/+26
commit fa04f5b60fda62c98a53a60de3a1e763f11feb41 upstream. Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. This behavior might not be respected on truncation. During truncation, the kernel splits a large folio in order to reclaim memory. As a side effect, it unmaps the folio and destroys PMD mappings of the folio. The folio will be refaulted as PTEs and SIGBUS semantics are preserved. However, if the split fails, PMD mappings are preserved and the user will not receive SIGBUS on any accesses within the PMD. Unmap the folio on split failure. It will lead to refault as PTEs and preserve SIGBUS semantics. Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-3-kirill@shutemov.name Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07mm/mm_init: fix hash table order logging in alloc_large_system_hash()Isaac J. Manjarres1-1/+1
[ Upstream commit 0d6c356dd6547adac2b06b461528e3573f52d953 ] When emitting the order of the allocation for a hash table, alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log base 2 of the allocation size. This is not correct if the allocation size is smaller than a page, and yields a negative value for the order as seen below: TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP bind hash table entries: 32 (order: -2, 1024 bytes, linear) Use get_order() to compute the order when emitting the hash table information to correctly handle cases where the allocation size is smaller than a page: TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP bind hash table entries: 32 (order: 0, 1024 bytes, linear) Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Isaac J. Manjarres <isaacmanjarres@google.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 0d6c356dd6547adac2b06b461528e3573f52d953) Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07mm/secretmem: fix use-after-free race in fault handlerLance Yang1-1/+1
[ Upstream commit 6f86d0534fddfbd08687fa0f01479d4226bc3c3d ] When a page fault occurs in a secret memory file created with `memfd_secret(2)`, the kernel will allocate a new page for it, mark the underlying page as not-present in the direct map, and add it to the file mapping. If two tasks cause a fault in the same page concurrently, both could end up allocating a page and removing the page from the direct map, but only one would succeed in adding the page to the file mapping. The task that failed undoes the effects of its attempt by (a) freeing the page again and (b) putting the page back into the direct map. However, by doing these two operations in this order, the page becomes available to the allocator again before it is placed back in the direct mapping. If another task attempts to allocate the page between (a) and (b), and the kernel tries to access it via the direct map, it would result in a supervisor not-present page fault. Fix the ordering to restore the direct map before the page is freed. Link: https://lkml.kernel.org/r/20251031120955.92116-1-lance.yang@linux.dev Fixes: 1507f51255c9 ("mm: introduce memfd_secret system call to create "secret" memory areas") Signed-off-by: Lance Yang <lance.yang@linux.dev> Reported-by: Google Big Sleep <big-sleep-vuln-reports@google.com> Closes: https://lore.kernel.org/linux-mm/CAEXGt5QeDpiHTu3K9tvjUTPqo+d-=wuCNYPa+6sWKrdQJ-ATdg@mail.gmail.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [rppt: replaced folio with page in the patch and in the changelog] Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-12-07mm, percpu: do not consider sleepable allocations atomicMichal Hocko1-1/+7
[ Upstream commit 9a5b183941b52f84c0f9e5f27ce44e99318c9e0f ] 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") has fixed a reclaim recursion for scoped GFP_NOFS context. It has done that by avoiding taking pcpu_alloc_mutex. This is a correct solution as the worker context with full GFP_KERNEL allocation/reclaim power and which is using the same lock cannot block the NOFS pcpu_alloc caller. On the other hand this is a very conservative approach that could lead to failures because pcpu_alloc lockless implementation is quite limited. We have a bug report about premature failures when scsi array of 193 devices is scanned. Sometimes (not consistently) the scanning aborts because the iscsid daemon fails to create the queue for a random scsi device during the scan. iscsid itself is running with PR_SET_IO_FLUSHER set so all allocations from this process context are GFP_NOIO. This in turn makes any pcpu_alloc lockless (without pcpu_alloc_mutex) which leads to pre-mature failures. It has turned out that iscsid has worked around this by dropping PR_SET_IO_FLUSHER (https://github.com/open-iscsi/open-iscsi/pull/382) when scanning host. But we can do better in this case on the kernel side and use pcpu_alloc_mutex for NOIO resp. NOFS constrained allocation scopes too. We just need the WQ worker to never trigger IO/FS reclaim. Achieve that by enforcing scoped GFP_NOIO for the whole execution of pcpu_balance_workfn (this will imply NOFS constrain as well). This will remove the dependency chain and preserve the full allocation power of the pcpu_alloc call. While at it make is_atomic really test for blockable allocations. Link: https://lkml.kernel.org/r/20250206122633.167896-1-mhocko@kernel.org Fixes: 28307d938fb2 ("percpu: make pcpu_alloc() aware of current gfp context") Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dennis Zhou <dennis@kernel.org> Cc: Filipe David Manana <fdmanana@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: chenxin <chenxinxin@xiaomi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07fs: factor out a direct_write_fallback helperChristoph Hellwig1-47/+14
commit 44fff0fa08ec5a6d9d5fb05443a36d854d0ece4d upstream. Add a helper dealing with handling the syncing of a buffered write fallback for direct I/O. Link: https://lkml.kernel.org/r/20230601145904.1385409-10-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hannes Reinecke <hare@suse.de> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [backing_dev_info still being used here. do small changes to the patch to keep the out label. Which means replacing all returns to goto out.] Signed-off-by: Mahmoud Adam <mngyadam@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07filemap: update ki_pos in generic_perform_writeChristoph Hellwig1-4/+4
commit 182c25e9c157f37bd0ab5a82fe2417e2223df459 upstream. All callers of generic_perform_write need to updated ki_pos, move it into common code. Link: https://lkml.kernel.org/r/20230601145904.1385409-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Theodore Ts'o <tytso@mit.edu> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mahmoud Adam <mngyadam@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07filemap: add a kiocb_invalidate_post_direct_write helperChristoph Hellwig1-17/+20
commit c402a9a9430b670926decbb284b756ee6f47c1ec upstream. Add a helper to invalidate page cache after a dio write. Link: https://lkml.kernel.org/r/20230601145904.1385409-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mahmoud Adam <mngyadam@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-12-07filemap: add a kiocb_invalidate_pages helperChristoph Hellwig1-20/+28
commit e003f74afbd2feadbb9ffbf9135e2d2fb5d320a5 upstream. Factor out a helper that calls filemap_write_and_wait_range and invalidate_inode_pages2_range for the range covered by a write kiocb or returns -EAGAIN if the kiocb is marked as nowait and there would be pages to write or invalidate. Link: https://lkml.kernel.org/r/20230601145904.1385409-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Chao Yu <chao@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Mahmoud Adam <mngyadam@amazon.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/hugetlb: early exit from hugetlb_pages_alloc_boot() when max_huge_pages=0Li RongQing1-0/+3
commit b322e88b3d553e85b4e15779491c70022783faa4 upstream. Optimize hugetlb_pages_alloc_boot() to return immediately when max_huge_pages is 0, avoiding unnecessary CPU cycles and the below log message when hugepages aren't configured in the kernel command line. [ 3.702280] HugeTLB: allocation took 0ms with hugepage_allocation_threads=32 Link: https://lkml.kernel.org/r/20250814102333.4428-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Tested-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-19mm/page_alloc: only set ALLOC_HIGHATOMIC for __GPF_HIGH allocationsThadeu Lima de Souza Cascardo1-1/+1
commit 6a204d4b14c99232e05d35305c27ebce1c009840 upstream. Commit 524c48072e56 ("mm/page_alloc: rename ALLOC_HIGH to ALLOC_MIN_RESERVE") is the start of a series that explains how __GFP_HIGH, which implies ALLOC_MIN_RESERVE, is going to be used instead of __GFP_ATOMIC for high atomic reserves. Commit eb2e2b425c69 ("mm/page_alloc: explicitly record high-order atomic allocations in alloc_flags") introduced ALLOC_HIGHATOMIC for such allocations of order higher than 0. It still used __GFP_ATOMIC, though. Then, commit 1ebbb21811b7 ("mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves") just turned that check for !__GFP_DIRECT_RECLAIM, ignoring that high atomic reserves were expected to test for __GFP_HIGH. This leads to high atomic reserves being added for high-order GFP_NOWAIT allocations and others that clear __GFP_DIRECT_RECLAIM, which is unexpected. Later, those reserves lead to 0-order allocations going to the slow path and starting reclaim. From /proc/pagetypeinfo, without the patch: Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 1 8 10 9 7 3 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 64 20 12 5 0 0 0 0 0 0 0 With the patch: Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Link: https://lkml.kernel.org/r/20250814172245.1259625-1-cascardo@igalia.com Fixes: 1ebbb21811b7 ("mm/page_alloc: explicitly define how __GFP_HIGH non-blocking allocations accesses reserves") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Tested-by: Helen Koike <koike@igalia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: NeilBrown <neilb@suse.de> Cc: Thierry Reding <thierry.reding@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-15mm: hugetlb: avoid soft lockup when mprotect to large memory areaYang Shi1-0/+2
commit f52ce0ea90c83a28904c7cc203a70e6434adfecb upstream. When calling mprotect() to a large hugetlb memory area in our customer's workload (~300GB hugetlb memory), soft lockup was observed: watchdog: BUG: soft lockup - CPU#98 stuck for 23s! [t2_new_sysv:126916] CPU: 98 PID: 126916 Comm: t2_new_sysv Kdump: loaded Not tainted 6.17-rc7 Hardware name: GIGACOMPUTING R2A3-T40-AAV1/Jefferson CIO, BIOS 5.4.4.1 07/15/2025 pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mte_clear_page_tags+0x14/0x24 lr : mte_sync_tags+0x1c0/0x240 sp : ffff80003150bb80 x29: ffff80003150bb80 x28: ffff00739e9705a8 x27: 0000ffd2d6a00000 x26: 0000ff8e4bc00000 x25: 00e80046cde00f45 x24: 0000000000022458 x23: 0000000000000000 x22: 0000000000000004 x21: 000000011b380000 x20: ffff000000000000 x19: 000000011b379f40 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc875e0aa5e2c x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : fffffc01ce7a5c00 x4 : 00000000046cde00 x3 : fffffc0000000000 x2 : 0000000000000004 x1 : 0000000000000040 x0 : ffff0046cde7c000 Call trace:   mte_clear_page_tags+0x14/0x24   set_huge_pte_at+0x25c/0x280   hugetlb_change_protection+0x220/0x430   change_protection+0x5c/0x8c   mprotect_fixup+0x10c/0x294   do_mprotect_pkey.constprop.0+0x2e0/0x3d4   __arm64_sys_mprotect+0x24/0x44   invoke_syscall+0x50/0x160   el0_svc_common+0x48/0x144   do_el0_svc+0x30/0xe0   el0_svc+0x30/0xf0   el0t_64_sync_handler+0xc4/0x148   el0t_64_sync+0x1a4/0x1a8 Soft lockup is not triggered with THP or base page because there is cond_resched() called for each PMD size. Although the soft lockup was triggered by MTE, it should be not MTE specific. The other processing which takes long time in the loop may trigger soft lockup too. So add cond_resched() for hugetlb to avoid soft lockup. Link: https://lkml.kernel.org/r/20250929202402.1663290-1-yang@os.amperecomputing.com Fixes: 8f860591ffb2 ("[PATCH] Enable mprotect on huge pages") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Tested-by: Carl Worth <carl@os.amperecomputing.com> Reviewed-by: Christoph Lameter (Ampere) <cl@gentwo.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02minmax: make generic MIN() and MAX() macros available everywhereLinus Torvalds1-1/+0
[ Upstream commit 1a251f52cfdc417c84411a056bc142cbd77baef4 ] This just standardizes the use of MIN() and MAX() macros, with the very traditional semantics. The goal is to use these for C constant expressions and for top-level / static initializers, and so be able to simplify the min()/max() macros. These macro names were used by various kernel code - they are very traditional, after all - and all such users have been fixed up, with a few different approaches: - trivial duplicated macro definitions have been removed Note that 'trivial' here means that it's obviously kernel code that already included all the major kernel headers, and thus gets the new generic MIN/MAX macros automatically. - non-trivial duplicated macro definitions are guarded with #ifndef This is the "yes, they define their own versions, but no, the include situation is not entirely obvious, and maybe they don't get the generic version automatically" case. - strange use case #1 A couple of drivers decided that the way they want to describe their versioning is with #define MAJ 1 #define MIN 2 #define DRV_VERSION __stringify(MAJ) "." __stringify(MIN) which adds zero value and I just did my Alexander the Great impersonation, and rewrote that pointless Gordian knot as #define DRV_VERSION "1.2" instead. - strange use case #2 A couple of drivers thought that it's a good idea to have a random 'MIN' or 'MAX' define for a value or index into a table, rather than the traditional macro that takes arguments. These values were re-written as C enum's instead. The new function-line macros only expand when followed by an open parenthesis, and thus don't clash with enum use. Happily, there weren't really all that many of these cases, and a lot of users already had the pattern of using '#ifndef' guarding (or in one case just using '#undef MIN') before defining their own private version that does the same thing. I left such cases alone. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Eliav Farber <farbere@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-10-02kmsan: fix out-of-bounds access to shadow memoryEric Biggers2-3/+23
[ Upstream commit 85e1ff61060a765d91ee62dc5606d4d547d9d105 ] Running sha224_kunit on a KMSAN-enabled kernel results in a crash in kmsan_internal_set_shadow_origin(): BUG: unable to handle page fault for address: ffffbc3840291000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G N 6.17.0-rc3 #10 PREEMPT(voluntary) Tainted: [N]=TEST Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014 RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100 [...] Call Trace: <TASK> __msan_memset+0xee/0x1a0 sha224_final+0x9e/0x350 test_hash_buffer_overruns+0x46f/0x5f0 ? kmsan_get_shadow_origin_ptr+0x46/0xa0 ? __pfx_test_hash_buffer_overruns+0x10/0x10 kunit_try_run_case+0x198/0xa00 This occurs when memset() is called on a buffer that is not 4-byte aligned and extends to the end of a guard page, i.e. the next page is unmapped. The bug is that the loop at the end of kmsan_internal_set_shadow_origin() accesses the wrong shadow memory bytes when the address is not 4-byte aligned. Since each 4 bytes are associated with an origin, it rounds the address and size so that it can access all the origins that contain the buffer. However, when it checks the corresponding shadow bytes for a particular origin, it incorrectly uses the original unrounded shadow address. This results in reads from shadow memory beyond the end of the buffer's shadow memory, which crashes when that memory is not mapped. To fix this, correctly align the shadow address before accessing the 4 shadow bytes corresponding to each origin. Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org Fixes: 2ef3cec44c60 ("kmsan: do not wipe out origin when doing partial unpoisoning") Signed-off-by: Eric Biggers <ebiggers@kernel.org> Tested-by: Alexander Potapenko <glider@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Adjust context in tests ] Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>