summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2025-05-29mm: vmalloc: only zero-init on vrealloc shrinkKees Cook1-5/+7
commit 70d1eb031a68cbde4eed8099674be21778441c94 upstream. The common case is to grow reallocations, and since init_on_alloc will have already zeroed the whole allocation, we only need to zero when shrinking the allocation. Link: https://lkml.kernel.org/r/20250515214217.619685-2-kees@kernel.org Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Kees Cook <kees@kernel.org> Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: "Erhard F." <erhard_f@mailbox.org> Cc: Shung-Hsi Yu <shung-hsi.yu@suse.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29mm: vmalloc: actually use the in-place vrealloc regionKees Cook1-0/+1
commit f7a35a3c36d1e36059c5654737d9bee3454f01a3 upstream. Patch series "mm: vmalloc: Actually use the in-place vrealloc region". This fixes a performance regression[1] with vrealloc()[1]. The refactoring to not build a new vmalloc region only actually worked when shrinking. Actually return the resized area when it grows. Ugh. Link: https://lkml.kernel.org/r/20250515214217.619685-1-kees@kernel.org Fixes: a0309faf1cb0 ("mm: vmalloc: support more granular vrealloc() sizing") Signed-off-by: Kees Cook <kees@kernel.org> Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Closes: https://lore.kernel.org/all/20250515-bpf-verifier-slowdown-vwo2meju4cgp2su5ckj@6gi6ssxbnfqg [1] Tested-by: Eduard Zingerman <eddyz87@gmail.com> Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Tested-by: Shung-Hsi Yu <shung-hsi.yu@suse.com> Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Cc: "Erhard F." <erhard_f@mailbox.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29mm/page_alloc.c: avoid infinite retries caused by cpuset raceTianyang Zhang1-0/+8
commit e05741fb10c38d70bbd7ec12b23c197b6355d519 upstream. __alloc_pages_slowpath has no change detection for ac->nodemask in the part of retry path, while cpuset can modify it in parallel. For some processes that set mempolicy as MPOL_BIND, this results ac->nodemask changes, and then the should_reclaim_retry will judge based on the latest nodemask and jump to retry, while the get_page_from_freelist only traverses the zonelist from ac->preferred_zoneref, which selected by a expired nodemask and may cause infinite retries in some cases cpu 64: __alloc_pages_slowpath { /* ..... */ retry: /* ac->nodemask = 0x1, ac->preferred->zone->nid = 1 */ if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac); /* cpu 1: cpuset_write_resmask update_nodemask update_nodemasks_hier update_tasks_nodemask mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask // mempolicy->nodes has been modified, // which ac->nodemask point to */ /* ac->nodemask = 0x3, ac->preferred->zone->nid = 1 */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0, &no_progress_loops)) goto retry; } Simultaneously starting multiple cpuset01 from LTP can quickly reproduce this issue on a multi node server when the maximum memory pressure is reached and the swap is enabled Link: https://lkml.kernel.org/r/20250416082405.20988-1-zhangtianyang@loongson.cn Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Tianyang Zhang <zhangtianyang@loongson.cn> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29mm/hugetlb: fix kernel NULL pointer dereference when replacing free hugetlb ↵Ge Yang1-0/+8
folios commit 113ed54ad276c352ee5ce109bdcf0df118a43bda upstream. A kernel crash was observed when replacing free hugetlb folios: BUG: kernel NULL pointer dereference, address: 0000000000000028 PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 28 UID: 0 PID: 29639 Comm: test_cma.sh Tainted 6.15.0-rc6-zp #41 PREEMPT(voluntary) RIP: 0010:alloc_and_dissolve_hugetlb_folio+0x1d/0x1f0 RSP: 0018:ffffc9000b30fa90 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000342cca RCX: ffffea0043000000 RDX: ffffc9000b30fb08 RSI: ffffea0043000000 RDI: 0000000000000000 RBP: ffffc9000b30fb20 R08: 0000000000001000 R09: 0000000000000000 R10: ffff88886f92eb00 R11: 0000000000000000 R12: ffffea0043000000 R13: 0000000000000000 R14: 00000000010c0200 R15: 0000000000000004 FS: 00007fcda5f14740(0000) GS:ffff8888ec1d8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 0000000391402000 CR4: 0000000000350ef0 Call Trace: <TASK> replace_free_hugepage_folios+0xb6/0x100 alloc_contig_range_noprof+0x18a/0x590 ? srso_return_thunk+0x5/0x5f ? down_read+0x12/0xa0 ? srso_return_thunk+0x5/0x5f cma_range_alloc.constprop.0+0x131/0x290 __cma_alloc+0xcf/0x2c0 cma_alloc_write+0x43/0xb0 simple_attr_write_xsigned.constprop.0.isra.0+0xb2/0x110 debugfs_attr_write+0x46/0x70 full_proxy_write+0x62/0xa0 vfs_write+0xf8/0x420 ? srso_return_thunk+0x5/0x5f ? filp_flush+0x86/0xa0 ? srso_return_thunk+0x5/0x5f ? filp_close+0x1f/0x30 ? srso_return_thunk+0x5/0x5f ? do_dup2+0xaf/0x160 ? srso_return_thunk+0x5/0x5f ksys_write+0x65/0xe0 do_syscall_64+0x64/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e There is a potential race between __update_and_free_hugetlb_folio() and replace_free_hugepage_folios(): CPU1 CPU2 __update_and_free_hugetlb_folio replace_free_hugepage_folios folio_test_hugetlb(folio) -- It's still hugetlb folio. __folio_clear_hugetlb(folio) hugetlb_free_folio(folio) h = folio_hstate(folio) -- Here, h is NULL pointer When the above race condition occurs, folio_hstate(folio) returns NULL, and subsequent access to this NULL pointer will cause the system to crash. To resolve this issue, execute folio_hstate(folio) under the protection of the hugetlb_lock lock, ensuring that folio_hstate(folio) does not return NULL. Link: https://lkml.kernel.org/r/1747884137-26685-1-git-send-email-yangge1116@126.com Fixes: 04f13d241b8b ("mm: replace free hugepage folios after migration") Signed-off-by: Ge Yang <yangge1116@126.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29memcg: always call cond_resched() after fn()Breno Leitao1-4/+2
commit 06717a7b6c86514dbd6ab322e8083ffaa4db5712 upstream. I am seeing soft lockup on certain machine types when a cgroup OOMs. This is happening because killing the process in certain machine might be very slow, which causes the soft lockup and RCU stalls. This happens usually when the cgroup has MANY processes and memory.oom.group is set. Example I am seeing in real production: [462012.244552] Memory cgroup out of memory: Killed process 3370438 (crosvm) .... .... [462037.318059] Memory cgroup out of memory: Killed process 4171372 (adb) .... [462037.348314] watchdog: BUG: soft lockup - CPU#64 stuck for 26s! [stat_manager-ag:1618982] .... Quick look at why this is so slow, it seems to be related to serial flush for certain machine types. For all the crashes I saw, the target CPU was at console_flush_all(). In the case above, there are thousands of processes in the cgroup, and it is soft locking up before it reaches the 1024 limit in the code (which would call the cond_resched()). So, cond_resched() in 1024 blocks is not sufficient. Remove the counter-based conditional rescheduling logic and call cond_resched() unconditionally after each task iteration, after fn() is called. This avoids the lockup independently of how slow fn() is. Link: https://lkml.kernel.org/r/20250523-memcg_fix-v1-1-ad3eafb60477@debian.org Fixes: ade81479c7dd ("memcg: fix soft lockup in the OOM process") Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Rik van Riel <riel@surriel.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Michael van der Westhuizen <rmikey@meta.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Chen Ridong <chenridong@huawei.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-29kasan: avoid sleepable page allocation from atomic contextAlexander Gordeev1-14/+78
commit b6ea95a34cbd014ab6ade4248107b86b0aaf2d6c upstream. apply_to_pte_range() enters the lazy MMU mode and then invokes kasan_populate_vmalloc_pte() callback on each page table walk iteration. However, the callback can go into sleep when trying to allocate a single page, e.g. if an architecutre disables preemption on lazy MMU mode enter. On s390 if make arch_enter_lazy_mmu_mode() -> preempt_enable() and arch_leave_lazy_mmu_mode() -> preempt_disable(), such crash occurs: [ 0.663336] BUG: sleeping function called from invalid context at ./include/linux/sched/mm.h:321 [ 0.663348] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 2, name: kthreadd [ 0.663358] preempt_count: 1, expected: 0 [ 0.663366] RCU nest depth: 0, expected: 0 [ 0.663375] no locks held by kthreadd/2. [ 0.663383] Preemption disabled at: [ 0.663386] [<0002f3284cbb4eda>] apply_to_pte_range+0xfa/0x4a0 [ 0.663405] CPU: 0 UID: 0 PID: 2 Comm: kthreadd Not tainted 6.15.0-rc5-gcc-kasan-00043-gd76bb1ebb558-dirty #162 PREEMPT [ 0.663408] Hardware name: IBM 3931 A01 701 (KVM/Linux) [ 0.663409] Call Trace: [ 0.663410] [<0002f3284c385f58>] dump_stack_lvl+0xe8/0x140 [ 0.663413] [<0002f3284c507b9e>] __might_resched+0x66e/0x700 [ 0.663415] [<0002f3284cc4f6c0>] __alloc_frozen_pages_noprof+0x370/0x4b0 [ 0.663419] [<0002f3284ccc73c0>] alloc_pages_mpol+0x1a0/0x4a0 [ 0.663421] [<0002f3284ccc8518>] alloc_frozen_pages_noprof+0x88/0xc0 [ 0.663424] [<0002f3284ccc8572>] alloc_pages_noprof+0x22/0x120 [ 0.663427] [<0002f3284cc341ac>] get_free_pages_noprof+0x2c/0xc0 [ 0.663429] [<0002f3284cceba70>] kasan_populate_vmalloc_pte+0x50/0x120 [ 0.663433] [<0002f3284cbb4ef8>] apply_to_pte_range+0x118/0x4a0 [ 0.663435] [<0002f3284cbc7c14>] apply_to_pmd_range+0x194/0x3e0 [ 0.663437] [<0002f3284cbc99be>] __apply_to_page_range+0x2fe/0x7a0 [ 0.663440] [<0002f3284cbc9e88>] apply_to_page_range+0x28/0x40 [ 0.663442] [<0002f3284ccebf12>] kasan_populate_vmalloc+0x82/0xa0 [ 0.663445] [<0002f3284cc1578c>] alloc_vmap_area+0x34c/0xc10 [ 0.663448] [<0002f3284cc1c2a6>] __get_vm_area_node+0x186/0x2a0 [ 0.663451] [<0002f3284cc1e696>] __vmalloc_node_range_noprof+0x116/0x310 [ 0.663454] [<0002f3284cc1d950>] __vmalloc_node_noprof+0xd0/0x110 [ 0.663457] [<0002f3284c454b88>] alloc_thread_stack_node+0xf8/0x330 [ 0.663460] [<0002f3284c458d56>] dup_task_struct+0x66/0x4d0 [ 0.663463] [<0002f3284c45be90>] copy_process+0x280/0x4b90 [ 0.663465] [<0002f3284c460940>] kernel_clone+0xd0/0x4b0 [ 0.663467] [<0002f3284c46115e>] kernel_thread+0xbe/0xe0 [ 0.663469] [<0002f3284c4e440e>] kthreadd+0x50e/0x7f0 [ 0.663472] [<0002f3284c38c04a>] __ret_from_fork+0x8a/0xf0 [ 0.663475] [<0002f3284ed57ff2>] ret_from_fork+0xa/0x38 Instead of allocating single pages per-PTE, bulk-allocate the shadow memory prior to applying kasan_populate_vmalloc_pte() callback on a page range. Link: https://lkml.kernel.org/r/c61d3560297c93ed044f0b1af085610353a06a58.1747316918.git.agordeev@linux.ibm.com Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory") Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Daniel Axtens <dja@axtens.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22mm/page_alloc: fix race condition in unaccepted memory handlingKirill A. Shutemov1-23/+0
commit fefc075182275057ce607effaa3daa9e6e3bdc73 upstream. The page allocator tracks the number of zones that have unaccepted memory using static_branch_enc/dec() and uses that static branch in hot paths to determine if it needs to deal with unaccepted memory. Borislav and Thomas pointed out that the tracking is racy: operations on static_branch are not serialized against adding/removing unaccepted pages to/from the zone. Sanity checks inside static_branch machinery detects it: WARNING: CPU: 0 PID: 10 at kernel/jump_label.c:276 __static_key_slow_dec_cpuslocked+0x8e/0xa0 The comment around the WARN() explains the problem: /* * Warn about the '-1' case though; since that means a * decrement is concurrent with a first (0->1) increment. IOW * people are trying to disable something that wasn't yet fully * enabled. This suggests an ordering problem on the user side. */ The effect of this static_branch optimization is only visible on microbenchmark. Instead of adding more complexity around it, remove it altogether. Link: https://lkml.kernel.org/r/20250506133207.1009676-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Link: https://lore.kernel.org/all/20250506092445.GBaBnVXXyvnazly6iF@fat_crate.local Reported-by: Borislav Petkov <bp@alien8.de> Tested-by: Borislav Petkov (AMD) <bp@alien8.de> Reported-by: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22mm: userfaultfd: correct dirty flags set for both present and swap pteBarry Song1-2/+10
commit 75cb1cca2c880179a11c7dd9380b6f14e41a06a4 upstream. As David pointed out, what truly matters for mremap and userfaultfd move operations is the soft dirty bit. The current comment and implementation—which always sets the dirty bit for present PTEs and fails to set the soft dirty bit for swap PTEs—are incorrect. This could break features like Checkpoint-Restore in Userspace (CRIU). This patch updates the behavior to correctly set the soft dirty bit for both present and swap PTEs in accordance with mremap. Link: https://lkml.kernel.org/r/20250508220912.7275-1-21cnbao@gmail.com Fixes: adef440691ba ("userfaultfd: UFFDIO_MOVE uABI") Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/02f14ee1-923f-47e3-a994-4950afb9afcc@redhat.com/ Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-22mm: hugetlb: fix incorrect fallback for subpoolWupeng Ma1-6/+22
commit a833a693a490ecff8ba377654c6d4d333718b6b1 upstream. During our testing with hugetlb subpool enabled, we observe that hstate->resv_huge_pages may underflow into negative values. Root cause analysis reveals a race condition in subpool reservation fallback handling as follow: hugetlb_reserve_pages() /* Attempt subpool reservation */ gbl_reserve = hugepage_subpool_get_pages(spool, chg); /* Global reservation may fail after subpool allocation */ if (hugetlb_acct_memory(h, gbl_reserve) < 0) goto out_put_pages; out_put_pages: /* This incorrectly restores reservation to subpool */ hugepage_subpool_put_pages(spool, chg); When hugetlb_acct_memory() fails after subpool allocation, the current implementation over-commits subpool reservations by returning the full 'chg' value instead of the actual allocated 'gbl_reserve' amount. This discrepancy propagates to global reservations during subsequent releases, eventually causing resv_huge_pages underflow. This problem can be trigger easily with the following steps: 1. reverse hugepage for hugeltb allocation 2. mount hugetlbfs with min_size to enable hugetlb subpool 3. alloc hugepages with two task(make sure the second will fail due to insufficient amount of hugepages) 4. with for a few seconds and repeat step 3 which will make hstate->resv_huge_pages to go below zero. To fix this problem, return corrent amount of pages to subpool during the fallback after hugepage_subpool_get_pages is called. Link: https://lkml.kernel.org/r/20250410062633.3102457-1-mawupeng1@huawei.com Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools") Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm: page_alloc: speed up fallbacks in rmqueue_bulk()Johannes Weiner1-34/+81
commit 90abee6d7895d5eef18c91d870d8168be4e76e9d upstream. The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in vm-scalability::lru-file-mmap-read. Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion"), as the root cause for a regression in worst-case zone->lock+irqoff hold times. Both of these patches modify the page allocator's fallback path to be less greedy in an effort to stave off fragmentation. The flip side of this is that fallbacks are also less productive each time around, which means the fallback search can run much more frequently. Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the percpu cache by allocating a large batch of pages in a loop. It highlights how once the native freelists are exhausted, the fallback code first scans orders top-down for whole blocks to claim, then falls back to a bottom-up search for the smallest buddy to steal. For the next batch page, it goes through the same thing again. This can be made more efficient. Since rmqueue_bulk() holds the zone->lock over the entire batch, the freelists are not subject to outside changes; when the search for a block to claim has already failed, there is no point in trying again for the next page. Modify __rmqueue() to remember the last successful fallback mode, and restart directly from there on the next rmqueue_bulk() iteration. Oliver confirms that this improves beyond the regression that the test robot reported against c2f6ea38fc1b: commit: f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput Carlos confirms that worst-case times are almost fully recovered compared to before the earlier culprit patch: 2dd482ba627d (before freelist hygiene): 1ms c0cd6f557b90 (after freelist hygiene): 90ms next-20250319 (steal smallest buddy): 280ms this patch : 8ms [jackmanb@google.com: comment updates] Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com [hannes@cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin] Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Brendan Jackman <jackmanb@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Carlos Song <carlos.song@nxp.com> Tested-by: Carlos Song <carlos.song@nxp.com> Tested-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com Reviewed-by: Brendan Jackman <jackmanb@google.com> Tested-by: Shivank Garg <shivankg@amd.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm: page_alloc: don't steal single pages from biggest buddyJohannes Weiner1-46/+34
commit c2f6ea38fc1b640aa7a2e155cc1c0410ff91afa2 upstream. The fallback code searches for the biggest buddy first in an attempt to steal the whole block and encourage type grouping down the line. The approach used to be this: - Non-movable requests will split the largest buddy and steal the remainder. This splits up contiguity, but it allows subsequent requests of this type to fall back into adjacent space. - Movable requests go and look for the smallest buddy instead. The thinking is that movable requests can be compacted, so grouping is less important than retaining contiguity. c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") enforces freelist type hygiene, which restricts stealing to either claiming the whole block or just taking the requested chunk; no additional pages or buddy remainders can be stolen any more. The patch mishandled when to switch to finding the smallest buddy in that new reality. As a result, it may steal the exact request size, but from the biggest buddy. This causes fracturing for no good reason. Fix this by committing to the new behavior: either steal the whole block, or fall back to the smallest buddy. Remove single-page stealing from steal_suitable_fallback(). Rename it to try_to_steal_block() to make the intentions clear. If this fails, always fall back to the smallest buddy. The following is from 4 runs of mmtest's thpchallenge. "Pollute" is single page fallback, "steal" is conversion of a partially used block. The numbers for free block conversions (omitted) are comparable. vanilla patched @pollute[unmovable from reclaimable]: 27 106 @pollute[unmovable from movable]: 82 46 @pollute[reclaimable from unmovable]: 256 83 @pollute[reclaimable from movable]: 46 8 @pollute[movable from unmovable]: 4841 868 @pollute[movable from reclaimable]: 5278 12568 @steal[unmovable from reclaimable]: 11 12 @steal[unmovable from movable]: 113 49 @steal[reclaimable from unmovable]: 19 34 @steal[reclaimable from movable]: 47 21 @steal[movable from unmovable]: 250 183 @steal[movable from reclaimable]: 81 93 The allocator appears to do a better job at keeping stealing and polluting to the first fallback preference. As a result, the numbers for "from movable" - the least preferred fallback option, and most detrimental to compactability - are down across the board. Link: https://lkml.kernel.org/r/20250225001023.1494422-2-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18memblock: Accept allocated memory before use in memblock_double_array()Tom Lendacky1-1/+8
commit da8bf5daa5e55a6af2b285ecda460d6454712ff4 upstream. When increasing the array size in memblock_double_array() and the slab is not yet available, a call to memblock_find_in_range() is used to reserve/allocate memory. However, the range returned may not have been accepted, which can result in a crash when booting an SNP guest: RIP: 0010:memcpy_orig+0x68/0x130 Code: ... RSP: 0000:ffffffff9cc03ce8 EFLAGS: 00010006 RAX: ff11001ff83e5000 RBX: 0000000000000000 RCX: fffffffffffff000 RDX: 0000000000000bc0 RSI: ffffffff9dba8860 RDI: ff11001ff83e5c00 RBP: 0000000000002000 R08: 0000000000000000 R09: 0000000000002000 R10: 000000207fffe000 R11: 0000040000000000 R12: ffffffff9d06ef78 R13: ff11001ff83e5000 R14: ffffffff9dba7c60 R15: 0000000000000c00 memblock_double_array+0xff/0x310 memblock_add_range+0x1fb/0x2f0 memblock_reserve+0x4f/0xa0 memblock_alloc_range_nid+0xac/0x130 memblock_alloc_internal+0x53/0xc0 memblock_alloc_try_nid+0x3d/0xa0 swiotlb_init_remap+0x149/0x2f0 mem_init+0xb/0xb0 mm_core_init+0x8f/0x350 start_kernel+0x17e/0x5d0 x86_64_start_reservations+0x14/0x30 x86_64_start_kernel+0x92/0xa0 secondary_startup_64_no_verify+0x194/0x19b Mitigate this by calling accept_memory() on the memory range returned before the slab is available. Prior to v6.12, the accept_memory() interface used a 'start' and 'end' parameter instead of 'start' and 'size', therefore the accept_memory() call must be adjusted to specify 'start + size' for 'end' when applying to kernels prior to v6.12. Cc: stable@vger.kernel.org # see patch description, needs adjustments for <= 6.11 Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/da1ac73bf4ded761e21b4e4bb5178382a580cd73.1746725050.git.thomas.lendacky@amd.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm/huge_memory: fix dereferencing invalid pmd migration entryGavin Guo1-3/+8
commit be6e843fc51a584672dfd9c4a6a24c8cb81d5fb7 upstream. When migrating a THP, concurrent access to the PMD migration entry during a deferred split scan can lead to an invalid address access, as illustrated below. To prevent this invalid access, it is necessary to check the PMD migration entry and return early. In this context, there is no need to use pmd_to_swp_entry and pfn_swap_entry_to_page to verify the equality of the target folio. Since the PMD migration entry is locked, it cannot be served as the target. Mailing list discussion and explanation from Hugh Dickins: "An anon_vma lookup points to a location which may contain the folio of interest, but might instead contain another folio: and weeding out those other folios is precisely what the "folio != pmd_folio((*pmd)" check (and the "risk of replacing the wrong folio" comment a few lines above it) is for." BUG: unable to handle page fault for address: ffffea60001db008 CPU: 0 UID: 0 PID: 2199114 Comm: tee Not tainted 6.14.0+ #4 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:split_huge_pmd_locked+0x3b5/0x2b60 Call Trace: <TASK> try_to_migrate_one+0x28c/0x3730 rmap_walk_anon+0x4f6/0x770 unmap_folio+0x196/0x1f0 split_huge_page_to_list_to_order+0x9f6/0x1560 deferred_split_scan+0xac5/0x12a0 shrinker_debugfs_scan_write+0x376/0x470 full_proxy_write+0x15c/0x220 vfs_write+0x2fc/0xcb0 ksys_write+0x146/0x250 do_syscall_64+0x6a/0x120 entry_SYSCALL_64_after_hwframe+0x76/0x7e The bug is found by syzkaller on an internal kernel, then confirmed on upstream. Link: https://lkml.kernel.org/r/20250421113536.3682201-1-gavinguo@igalia.com Link: https://lore.kernel.org/all/20250414072737.1698513-1-gavinguo@igalia.com/ Link: https://lore.kernel.org/all/20250418085802.2973519-1-gavinguo@igalia.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Gavin Guo <gavinguo@igalia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Cc: Florent Revest <revest@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm: vmalloc: support more granular vrealloc() sizingKees Cook1-7/+24
commit a0309faf1cb0622cac7c820150b7abf2024acff5 upstream. Introduce struct vm_struct::requested_size so that the requested (re)allocation size is retained separately from the allocated area size. This means that KASAN will correctly poison the correct spans of requested bytes. This also means we can support growing the usable portion of an allocation that can already be supported by the existing area's existing allocation. Link: https://lkml.kernel.org/r/20250426001105.it.679-kees@kernel.org Fixes: 3ddc2fefe6f3 ("mm: vmalloc: implement vrealloc()") Signed-off-by: Kees Cook <kees@kernel.org> Reported-by: Erhard Furtner <erhard_f@mailbox.org> Closes: https://lore.kernel.org/all/20250408192503.6149a816@outsider.home/ Reviewed-by: Danilo Krummrich <dakr@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-18mm: fix folio_pte_batch() on XEN PVPetr Vaněk1-16/+11
commit 7b08b74f3d99f6b801250683c751d391128799ec upstream. On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a folio due to a corner case in pte_advance_pfn(). Specifically, when the PFN following the folio maps to an invalidated MFN, expected_pte = pte_advance_pfn(expected_pte, nr); produces a pte_none(). If the actual next PTE in memory is also pte_none(), the pte_same() succeeds, if (!pte_same(pte, expected_pte)) break; the loop is not broken, and batching continues into unrelated memory. For example, with a 4-page folio, the PTE layout might look like this: [ 53.465673] [ T2552] folio_pte_batch: printing PTE values at addr=0x7f1ac9dc5000 [ 53.465674] [ T2552] PTE[453] = 000000010085c125 [ 53.465679] [ T2552] PTE[454] = 000000010085d125 [ 53.465682] [ T2552] PTE[455] = 000000010085e125 [ 53.465684] [ T2552] PTE[456] = 000000010085f125 [ 53.465686] [ T2552] PTE[457] = 0000000000000000 <-- not present [ 53.465689] [ T2552] PTE[458] = 0000000101da7125 pte_advance_pfn(PTE[456]) returns a pte_none() due to invalid PFN->MFN mapping. The next actual PTE (PTE[457]) is also pte_none(), so the loop continues and includes PTE[457] in the batch, resulting in 5 batched entries for a 4-page folio. This triggers the following warning: [ 53.465751] [ T2552] page: refcount:85 mapcount:20 mapping:ffff88813ff4f6a8 index:0x110 pfn:0x10085c [ 53.465754] [ T2552] head: order:2 mapcount:80 entire_mapcount:0 nr_pages_mapped:4 pincount:0 [ 53.465756] [ T2552] memcg:ffff888003573000 [ 53.465758] [ T2552] aops:0xffffffff8226fd20 ino:82467c dentry name(?):"libc.so.6" [ 53.465761] [ T2552] flags: 0x2000000000416c(referenced|uptodate|lru|active|private|head|node=0|zone=2) [ 53.465764] [ T2552] raw: 002000000000416c ffffea0004021f08 ffffea0004021908 ffff88813ff4f6a8 [ 53.465767] [ T2552] raw: 0000000000000110 ffff888133d8bd40 0000005500000013 ffff888003573000 [ 53.465768] [ T2552] head: 002000000000416c ffffea0004021f08 ffffea0004021908 ffff88813ff4f6a8 [ 53.465770] [ T2552] head: 0000000000000110 ffff888133d8bd40 0000005500000013 ffff888003573000 [ 53.465772] [ T2552] head: 0020000000000202 ffffea0004021701 000000040000004f 00000000ffffffff [ 53.465774] [ T2552] head: 0000000300000003 8000000300000002 0000000000000013 0000000000000004 [ 53.465775] [ T2552] page dumped because: VM_WARN_ON_FOLIO((_Generic((page + nr_pages - 1), const struct page *: (const struct folio *)_compound_head(page + nr_pages - 1), struct page *: (struct folio *)_compound_head(page + nr_pages - 1))) != folio) Original code works as expected everywhere, except on XEN PV, where pte_advance_pfn() can yield a pte_none() after balloon inflation due to MFNs invalidation. In XEN, pte_advance_pfn() ends up calling __pte()->xen_make_pte()->pte_pfn_to_mfn(), which returns pte_none() when mfn == INVALID_P2M_ENTRY. The pte_pfn_to_mfn() documents that nastiness: If there's no mfn for the pfn, then just create an empty non-present pte. Unfortunately this loses information about the original pfn, so pte_mfn_to_pfn is asymmetric. While such hacks should certainly be removed, we can do better in folio_pte_batch() and simply check ahead of time how many PTEs we can possibly batch in our folio. This way, we can not only fix the issue but cleanup the code: removing the pte_pfn() check inside the loop body and avoiding end_ptr comparison + arithmetic. Link: https://lkml.kernel.org/r/20250502215019.822-2-arkamar@atlas.cz Fixes: f8d937761d65 ("mm/memory: optimize fork() with PTE-mapped THP") Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Petr Vaněk <arkamar@atlas.cz> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09mm, slab: clean up slab->obj_exts alwaysZhenhua Huang1-20/+7
commit be8250786ca94952a19ce87f98ad9906448bc9ef upstream. When memory allocation profiling is disabled at runtime or due to an error, shutdown_mem_profiling() is called: slab->obj_exts which previously allocated remains. It won't be cleared by unaccount_slab() because of mem_alloc_profiling_enabled() not true. It's incorrect, slab->obj_exts should always be cleaned up in unaccount_slab() to avoid following error: [...]BUG: Bad page state in process... .. [...]page dumped because: page still charged to cgroup [andriy.shevchenko@linux.intel.com: fold need_slab_obj_ext() into its only user] Fixes: 21c690a349ba ("mm: introduce slabobj_ext to support slab object extensions") Cc: stable@vger.kernel.org Signed-off-by: Zhenhua Huang <quic_zhenhuah@quicinc.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Suren Baghdasaryan <surenb@google.com> Link: https://patch.msgid.link/20250421075232.2165527-1-quic_zhenhuah@quicinc.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> [surenb: fixed trivial merge conflict in alloc_tagging_slab_alloc_hook(), skipped inlining free_slab_obj_exts() as it's already inline in 6.14] Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09mm/memblock: repeat setting reserved region nid if array is doubledWei Yang1-0/+10
commit eac8ea8736ccc09513152d970eb2a42ed78e87e8 upstream. Commit 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") introduce a way to set nid to all reserved region. But there is a corner case it will leave some region with invalid nid. When memblock_set_node() doubles the array of memblock.reserved, it may lead to a new reserved region before current position. The new region will be left with an invalid node id. Repeat the process when detecting it. Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> CC: Mike Rapoport <rppt@kernel.org> CC: Yajun Deng <yajun.deng@linux.dev> CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250318071948.23854-3-richard.weiyang@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-09mm/memblock: pass size instead of end to memblock_set_node()Wei Yang1-1/+1
commit 06eaa824fd239edd1eab2754f29b2d03da313003 upstream. The second parameter of memblock_set_node() is size instead of end. Since it iterates from lower address to higher address, finally the node id is correct. But during the process, some of them are wrong. Pass size instead of end. Fixes: 61167ad5fecd ("mm: pass nid to reserve_bootmem_region()") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> CC: Mike Rapoport <rppt@kernel.org> CC: Yajun Deng <yajun.deng@linux.dev> CC: stable@vger.kernel.org Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20250318071948.23854-2-richard.weiyang@gmail.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-05-02mm/vmscan: don't try to reclaim hwpoison folioJinjiang Tu1-0/+7
[ Upstream commit 1b0449544c6482179ac84530b61fc192a6527bfd ] Syzkaller reports a bug as follows: Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000 anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:184! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT) pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158 lr : add_to_swap+0xbc/0x158 sp : ffff800087f37340 x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace: add_to_swap+0xbc/0x158 shrink_folio_list+0x12ac/0x2648 shrink_inactive_list+0x318/0x948 shrink_lruvec+0x450/0x720 shrink_node_memcgs+0x280/0x4a8 shrink_node+0x128/0x978 balance_pgdat+0x4f0/0xb20 kswapd+0x228/0x438 kthread+0x214/0x230 ret_from_fork+0x10/0x20 I can reproduce this issue with the following steps: 1) When a dirty swapcache page is isolated by reclaim process and the page isn't locked, inject memory failure for the page. me_swapcache_dirty() clears uptodate flag and tries to delete from lru, but fails. Reclaim process will put the hwpoisoned page back to lru. 2) The process that maps the hwpoisoned page exits, the page is deleted the page will never be freed and will be in the lru forever. 3) If we trigger a reclaim again and tries to reclaim the page, add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is cleared. To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list. Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-25mm/vma: add give_up_on_oom option on modify/merge, use in uffd releaseLorenzo Stoakes3-7/+53
commit 41e6ddcaa0f18dda4c3fadf22533775a30d6f72f upstream. Currently, if a VMA merge fails due to an OOM condition arising on commit merge or a failure to duplicate anon_vma's, we report this so the caller can handle it. However there are cases where the caller is only ostensibly trying a merge, and doesn't mind if it fails due to this condition. Since we do not want to introduce an implicit assumption that we only actually modify VMAs after OOM conditions might arise, add a 'give up on oom' option and make an explicit contract that, should this flag be set, we absolutely will not modify any VMAs should OOM arise and just bail out. Since it'd be very unusual for a user to try to vma_modify() with this flag set but be specifying a range within a VMA which ends up being split (which can fail due to rlimit issues, not only OOM), we add a debug warning for this condition. The motivating reason for this is uffd release - syzkaller (and Pedro Falcato's VERY astute analysis) found a way in which an injected fault on allocation, triggering an OOM condition on commit merge, would result in uffd code becoming confused and treating an error value as if it were a VMA pointer. To avoid this, we make use of this new VMG flag to ensure that this never occurs, utilising the fact that, should we be clearing entire VMAs, we do not wish an OOM event to be reported to us. Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for his insightful and intelligent analysis of the situation, both of whom were instrumental in this fix. Link: https://lkml.kernel.org/r/20250321100937.46634-1-lorenzo.stoakes@oracle.com Reported-by: syzbot+20ed41006cf9d842c2b5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001e.GAE@google.com/ Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Pedro Falcato <pfalcato@suse.de> Suggested-by: Jann Horn <jannh@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25slab: ensure slab->obj_exts is clear in a newly allocated slab pageSuren Baghdasaryan1-0/+10
commit d2f5819b6ed357c0c350c0616b6b9f38be59adf6 upstream. ktest recently reported crashes while running several buffered io tests with __alloc_tagging_slab_alloc_hook() at the top of the crash call stack. The signature indicates an invalid address dereference with low bits of slab->obj_exts being set. The bits were outside of the range used by page_memcg_data_flags and objext_flags and hence were not masked out by slab_obj_exts() when obtaining the pointer stored in slab->obj_exts. The typical crash log looks like this: 00510 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010 00510 Mem abort info: 00510 ESR = 0x0000000096000045 00510 EC = 0x25: DABT (current EL), IL = 32 bits 00510 SET = 0, FnV = 0 00510 EA = 0, S1PTW = 0 00510 FSC = 0x05: level 1 translation fault 00510 Data abort info: 00510 ISV = 0, ISS = 0x00000045, ISS2 = 0x00000000 00510 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 00510 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 00510 user pgtable: 4k pages, 39-bit VAs, pgdp=0000000104175000 00510 [0000000000000010] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 00510 Internal error: Oops: 0000000096000045 [#1] SMP 00510 Modules linked in: 00510 CPU: 10 UID: 0 PID: 7692 Comm: cat Not tainted 6.15.0-rc1-ktest-g189e17946605 #19327 NONE 00510 Hardware name: linux,dummy-virt (DT) 00510 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00510 pc : __alloc_tagging_slab_alloc_hook+0xe0/0x190 00510 lr : __kmalloc_noprof+0x150/0x310 00510 sp : ffffff80c87df6c0 00510 x29: ffffff80c87df6c0 x28: 000000000013d1ff x27: 000000000013d200 00510 x26: ffffff80c87df9e0 x25: 0000000000000000 x24: 0000000000000001 00510 x23: ffffffc08041953c x22: 000000000000004c x21: ffffff80c0002180 00510 x20: fffffffec3120840 x19: ffffff80c4821000 x18: 0000000000000000 00510 x17: fffffffec3d02f00 x16: fffffffec3d02e00 x15: fffffffec3d00700 00510 x14: fffffffec3d00600 x13: 0000000000000200 x12: 0000000000000006 00510 x11: ffffffc080bb86c0 x10: 0000000000000000 x9 : ffffffc080201e58 00510 x8 : ffffff80c4821060 x7 : 0000000000000000 x6 : 0000000055555556 00510 x5 : 0000000000000001 x4 : 0000000000000010 x3 : 0000000000000060 00510 x2 : 0000000000000000 x1 : ffffffc080f50cf8 x0 : ffffff80d801d000 00510 Call trace: 00510 __alloc_tagging_slab_alloc_hook+0xe0/0x190 (P) 00510 __kmalloc_noprof+0x150/0x310 00510 __bch2_folio_create+0x5c/0xf8 00510 bch2_folio_create+0x2c/0x40 00510 bch2_readahead+0xc0/0x460 00510 read_pages+0x7c/0x230 00510 page_cache_ra_order+0x244/0x3a8 00510 page_cache_async_ra+0x124/0x170 00510 filemap_readahead.isra.0+0x58/0xa0 00510 filemap_get_pages+0x454/0x7b0 00510 filemap_read+0xdc/0x418 00510 bch2_read_iter+0x100/0x1b0 00510 vfs_read+0x214/0x300 00510 ksys_read+0x6c/0x108 00510 __arm64_sys_read+0x20/0x30 00510 invoke_syscall.constprop.0+0x54/0xe8 00510 do_el0_svc+0x44/0xc8 00510 el0_svc+0x18/0x58 00510 el0t_64_sync_handler+0x104/0x130 00510 el0t_64_sync+0x154/0x158 00510 Code: d5384100 f9401c01 b9401aa3 b40002e1 (f8227881) 00510 ---[ end trace 0000000000000000 ]--- 00510 Kernel panic - not syncing: Oops: Fatal exception 00510 SMP: stopping secondary CPUs 00510 Kernel Offset: disabled 00510 CPU features: 0x0000,000000e0,00000410,8240500b 00510 Memory Limit: none Investigation indicates that these bits are already set when we allocate slab page and are not zeroed out after allocation. We are not yet sure why these crashes start happening only recently but regardless of the reason, not initializing a field that gets used later is wrong. Fix it by initializing slab->obj_exts during slab page allocation. Fixes: 21c690a349ba ("mm: introduce slabobj_ext to support slab object extensions") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Tested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250411155737.1360746-1-surenb@google.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25mm: fix apply_to_existing_page_range()Kirill A. Shutemov1-2/+2
commit a995199384347261bb3f21b2e171fa7f988bd2f8 upstream. In the case of apply_to_existing_page_range(), apply_to_pte_range() is reached with 'create' set to false. When !create, the loop over the PTE page table is broken. apply_to_pte_range() will only move to the next PTE entry if 'create' is true or if the current entry is not pte_none(). This means that the user of apply_to_existing_page_range() will not have 'fn' called for any entries after the first pte_none() in the PTE page table. Fix the loop logic in apply_to_pte_range(). There are no known runtime issues from this, but the fix is trivial enough for stable@ even without a known buggy user. Link: https://lkml.kernel.org/r/20250409094043.1629234-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper") Cc: Daniel Axtens <dja@axtens.net> Cc: David Hildenbrand <david@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25mm: fix filemap_get_folios_contig returning batches of identical foliosVishal Moola (Oracle)1-0/+1
commit 8ab1b16023961dc640023b10436d282f905835ad upstream. filemap_get_folios_contig() is supposed to return distinct folios found within [start, end]. Large folios in the Xarray become multi-index entries. xas_next() can iterate through the sub-indexes before finding a sibling entry and breaking out of the loop. This can result in a returned folio_batch containing an indeterminate number of duplicate folios, which forces the callers to skeptically handle the returned batch. This is inefficient and incurs a large maintenance overhead. We can fix this by calling xas_advance() after we have successfully adding a folio to the batch to ensure our Xarray is positioned such that it will correctly find the next folio - similar to filemap_get_read_batch(). Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()") Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reported-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Closes: https://lkml.kernel.org/r/b714e4de-2583-4035-b829-72cfb5eb6fc6@gmx.com Tested-by: Qu Wenruo <quwenruo.btrfs@gmx.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()Baoquan He1-2/+2
commit 8c03ebd7cdc06bd0d2fecb4d1a609ef1dbb7d0aa upstream. Not like fault_in_readable() or fault_in_writeable(), in fault_in_safe_writeable() local variable 'start' is increased page by page to loop till the whole address range is handled. However, it mistakenly calculates the size of the handled range with 'uaddr - start'. Fix it here. Andreas said: : In gfs2, fault_in_iov_iter_writeable() is used in : gfs2_file_direct_read() and gfs2_file_read_iter(), so this potentially : affects buffered as well as direct reads. This bug could cause those : gfs2 functions to spin in a loop. Link: https://lkml.kernel.org/r/20250410035717.473207-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20250410035717.473207-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Fixes: fe673d3f5bf1 ("mm: gup: make fault_in_safe_writeable() use fixup_user_fault()") Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Yanjun.Zhu <yanjun.zhu@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-25mm/compaction: fix bug in hugetlb handling pathwayVishal Moola (Oracle)1-3/+3
commit a84edd52f0a0fa193f0f685769939cf84510755b upstream. The compaction code doesn't take references on pages until we're certain we should attempt to handle it. In the hugetlb case, isolate_or_dissolve_huge_page() may return -EBUSY without taking a reference to the folio associated with our pfn. If our folio's refcount drops to 0, compound_nr() becomes unpredictable, making low_pfn and nr_scanned unreliable. The user-visible effect is minimal - this should rarely happen (if ever). Fix this by storing the folio statistics earlier on the stack (just like the THP and Buddy cases). Also revert commit 66fe1cf7f581 ("mm: compaction: use helper compound_nr in isolate_migratepages_block") to make backporting easier. Link: https://lkml.kernel.org/r/20250401021025.637333-1-vishal.moola@gmail.com Fixes: 369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages") Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/hwpoison: introduce folio_contain_hwpoisoned_page() helperJinjiang Tu2-4/+2
commit 5f5ee52d4f58605330b09851273d6e56aaadd29e upstream. Patch series "mm/vmscan: don't try to reclaim hwpoison folio". Fix a bug during memory reclaim if folio is hwpoisoned. This patch (of 2): Introduce helper folio_contain_hwpoisoned_page() to check if the entire folio is hwpoisoned or it contains hwpoisoned pages. Link: https://lkml.kernel.org/r/20250318083939.987651-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250318083939.987651-2-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/hugetlb: move hugetlb_sysctl_init() to the __init sectionMarc Herbert1-1/+1
commit 1ca77ff1837249701053a7fcbdedabc41f4ae67c upstream. hugetlb_sysctl_init() is only invoked once by an __init function and is merely a wrapper around another __init function so there is not reason to keep it. Fixes the following warning when toning down some GCC inline options: WARNING: modpost: vmlinux: section mismatch in reference: hugetlb_sysctl_init+0x1b (section: .text) -> __register_sysctl_init (section: .init.text) Link: https://lkml.kernel.org/r/20250319060041.2737320-1-marc.herbert@linux.intel.com Signed-off-by: Marc Herbert <Marc.Herbert@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/hwpoison: do not send SIGBUS to processes with recovered clean pagesShuai Xue1-3/+8
commit aaf99ac2ceb7c974f758a635723eeaf48596388e upstream. When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. - Background: why *UN*corrected errors tied to *C*MCI in Intel platform [1] Prior to Icelake memory controllers reported patrol scrub events that detected a previously unseen uncorrected error in memory by signaling a broadcast machine check with an SRAO (Software Recoverable Action Optional) signature in the machine check bank. This was overkill because it's not an urgent problem that no core is on the verge of consuming that bad data. It's also found that multi SRAO UCE may cause nested MCE interrupts and finally become an IERR. Hence, Intel downgrades the machine check bank signature of patrol scrub from SRAO to UCNA (Uncorrected, No Action required), and signal changed to #CMCI. Just to add to the confusion, Linux does take an action (in uc_decode_notifier()) to try to offline the page despite the UC*NA* signature name. - Background: why #CMCI and #MCE race when poison is consuming in Intel platform [1] Having decided that CMCI/UCNA is the best action for patrol scrub errors, the memory controller uses it for reads too. But the memory controller is executing asynchronously from the core, and can't tell the difference between a "real" read and a speculative read. So it will do CMCI/UCNA if an error is found in any read. Thus: 1) Core is clever and thinks address A is needed soon, issues a speculative read. 2) Core finds it is going to use address A soon after sending the read request 3) The CMCI from the memory controller is in a race with MCE from the core that will soon try to retire the load from address A. Quite often (because speculation has got better) the CMCI from the memory controller is delivered before the core is committed to the instruction reading address A, so the interrupt is taken, and Linux offlines the page (marking it as poison). - Why user process is killed for instr case Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") tries to fix noise message "Memory error not recovered" and skips duplicate SIGBUSs due to the race. But it also introduced a bug that kill_accessing_process() return -EHWPOISON for instr case, as result, kill_me_maybe() send a SIGBUS to user process. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure(). For dirty pages, memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag, converting the PTE to a hwpoison entry. As a result, kill_accessing_process(): - call walk_page_range() and return 1 regardless of whether try_to_unmap() succeeds or fails, - call kill_proc() to make sure a SIGBUS is sent - return -EHWPOISON to indicate that SIGBUS is already sent to the process and kill_me_maybe() doesn't have to send it again. However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted to a hwpoison entry. Conversely, for clean pages where PTE entries are not marked as hwpoison, kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a SIGBUS. Console log looks like this: Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered Memory failure: 0x827ca68: already hardware poisoned mce: Memory error not recovered To fix it, return 0 for "corrupted page was clean", preventing an unnecessary SIGBUS to user process. [1] https://lore.kernel.org/lkml/20250217063335.22257-1-xueshuai@linux.alibaba.com/T/#mba94f1305b3009dd340ce4114d3221fe810d1871 Link: https://lkml.kernel.org/r/20250312112852.82415-3-xueshuai@linux.alibaba.com Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> Tested-by: Tony Luck <tony.luck@intel.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Borislav Betkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ruidong Tian <tianruidong@linux.alibaba.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm: add missing release barrier on PGDAT_RECLAIM_LOCKED unlockMathieu Desnoyers1-1/+1
commit c0ebbb3841e07c4493e6fe351698806b09a87a37 upstream. The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node reclaim for struct pglist_data using a single bit. It is "locked" with a test_and_set_bit (similarly to a try lock) which provides full ordering with respect to loads and stores done within __node_reclaim(). It is "unlocked" with clear_bit(), which does not provide any ordering with respect to loads and stores done before clearing the bit. The lack of clear_bit() memory ordering with respect to stores within __node_reclaim() can cause a subsequent CPU to fail to observe stores from a prior node reclaim. This is not an issue in practice on TSO (e.g. x86), but it is an issue on weakly-ordered architectures (e.g. arm64). Fix this by using clear_bit_unlock rather than clear_bit to clear PGDAT_RECLAIM_LOCKED with a release memory ordering semantic. This provides stronger memory ordering (release rather than relaxed). Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Jade Alglave <j.alglave@ucl.ac.uk> Cc: Luc Maranget <luc.maranget@inria.fr> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/mremap: correctly handle partial mremap() of VMA starting at 0Lorenzo Stoakes1-5/+5
commit 937582ee8e8d227c30ec147629a0179131feaa80 upstream. Patch series "refactor mremap and fix bug", v3. The existing mremap() logic has grown organically over a very long period of time, resulting in code that is in many parts, very difficult to follow and full of subtleties and sources of confusion. In addition, it is difficult to thread state through the operation correctly, as function arguments have expanded, some parameters are expected to be temporarily altered during the operation, others are intended to remain static and some can be overridden. This series completely refactors the mremap implementation, sensibly separating functions, adding comments to explain the more subtle aspects of the implementation and making use of small structs to thread state through everything. The reason for doing so is to lay the groundwork for planned future changes to the mremap logic, changes which require the ability to easily pass around state. Additionally, it would be unhelpful to add yet more logic to code that is already difficult to follow without first refactoring it like this. The first patch in this series additionally fixes a bug when a VMA with start address zero is partially remapped. Tested on real hardware under heavy workload and all self tests are passing. This patch (of 3): Consider the case of a partial mremap() (that results in a VMA split) of an accountable VMA (i.e. which has the VM_ACCOUNT flag set) whose start address is zero, with the MREMAP_MAYMOVE flag specified and a scenario where a move does in fact occur: addr end | | v v |-------------| | vma | |-------------| 0 This move is affected by unmapping the range [addr, end). In order to prevent an incorrect decrement of accounted memory which has already been determined, the mremap() code in move_vma() clears VM_ACCOUNT from the VMA prior to doing so, before reestablishing it in each of the VMAs post-split: addr end | | v v |---| |---| | A | | B | |---| |---| Commit 6b73cff239e5 ("mm: change munmap splitting order and move_vma()") changed this logic such as to determine whether there is a need to do so by establishing account_start and account_end and, in the instance where such an operation is required, assigning them to vma->vm_start and vma->vm_end. Later the code checks if the operation is required for 'A' referenced above thusly: if (account_start) { ... } However, if the VMA described above has vma->vm_start == 0, which is now assigned to account_start, this branch will not be executed. As a result, the VMA 'A' above will remain stripped of its VM_ACCOUNT flag, incorrectly. The fix is to simply convert these variables to booleans and set them as required. Link: https://lkml.kernel.org/r/cover.1741639347.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/dc55cb6db25d97c3d9e460de4986a323fa959676.1741639347.git.lorenzo.stoakes@oracle.com Fixes: 6b73cff239e5 ("mm: change munmap splitting order and move_vma()") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm: make page_mapped_in_vma() hugetlb walk awareJane Chu1-4/+9
commit 442b1eca223b4860cc85ef970ae602d125aec5a4 upstream. When a process consumes a UE in a page, the memory failure handler attempts to collect information for a potential SIGBUS. If the page is an anonymous page, page_mapped_in_vma(page, vma) is invoked in order to 1. retrieve the vaddr from the process' address space, 2. verify that the vaddr is indeed mapped to the poisoned page, where 'page' is the precise small page with UE. It's been observed that when injecting poison to a non-head subpage of an anonymous hugetlb page, no SIGBUS shows up, while injecting to the head page produces a SIGBUS. The cause is that, though hugetlb_walk() returns a valid pmd entry (on x86), but check_pte() detects mismatch between the head page per the pmd and the input subpage. Thus the vaddr is considered not mapped to the subpage and the process is not collected for SIGBUS purpose. This is the calling stack: collect_procs_anon page_mapped_in_vma page_vma_mapped_walk hugetlb_walk huge_pte_lock check_pte check_pte() header says that it "check if [pvmw->pfn, @pvmw->pfn + @pvmw->nr_pages) is mapped at the @pvmw->pte" but practically works only if pvmw->pfn is the head page pfn at pvmw->pte. Hindsight acknowledging that some pvmw->pte could point to a hugepage of some sort such that it makes sense to make check_pte() work for hugepage. Link: https://lkml.kernel.org/r/20250224211445.2663312-1-jane.chu@oracle.com Signed-off-by: Jane Chu <jane.chu@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: linmiaohe <linmiaohe@huawei.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/rmap: reject hugetlb folios in folio_make_device_exclusive()David Hildenbrand1-1/+1
commit bc3fe6805cf09a25a086573a17d40e525208c5d8 upstream. Even though FOLL_SPLIT_PMD on hugetlb now always fails with -EOPNOTSUPP, let's add a safety net in case FOLL_SPLIT_PMD usage would ever be reworked. In particular, before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code"), GUP(FOLL_SPLIT_PMD) would just have returned a page. In particular, hugetlb folios that are not PMD-sized would never have been prone to FOLL_SPLIT_PMD. hugetlb folios can be anonymous, and page_make_device_exclusive_one() is not really prepared for handling them at all. So let's spell that out. Link: https://lkml.kernel.org/r/20250210193801.781278-3-david@redhat.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Barry Song <v-songbaohua@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/damon: avoid applying DAMOS action to same entity multiple timesSeongJae Park2-12/+28
commit 94ba17adaba0f651fdcf745c8891a88e2e028cfa upstream. 'paddr' DAMON operations set can apply a DAMOS scheme's action to a large folio multiple times in single DAMOS-regions-walk if the folio is laid on multiple DAMON regions. Add a field for DAMOS scheme object that can be used by the underlying ops to know what was the last entity that the scheme's action has applied. The core layer unsets the field when each DAMOS-regions-walk is done for the given scheme. And update 'paddr' ops to use the infrastructure to avoid the problem. Link: https://lkml.kernel.org/r/20250207212033.45269-3-sj@kernel.org Fixes: 57223ac29584 ("mm/damon/paddr: support the pageout scheme") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: Usama Arif <usamaarif642@gmail.com> Closes: https://lore.kernel.org/20250203225604.44742-3-usamaarif642@gmail.com Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-20mm/damon/ops: have damon_get_folio return folio even for tail pagesUsama Arif2-7/+19
commit 3a06696305e757f652dd0dcf4dfa2272eda39434 upstream. Patch series "mm/damon/paddr: fix large folios access and schemes handling". DAMON operations set for physical address space, namely 'paddr', treats tail pages as unaccessed always. It can also apply DAMOS action to a large folio multiple times within single DAMOS' regions walking. As a result, the monitoring output has poor quality and DAMOS works in unexpected ways when large folios are being used. Fix those. The patches were parts of Usama's hugepage_size DAMOS filter patch series[1]. The first fix has collected from there with a slight commit message change for the subject prefix. The second fix is re-written by SJ and posted as an RFC before this series. The second one also got a slight commit message change for the subject prefix. [1] https://lore.kernel.org/20250203225604.44742-1-usamaarif642@gmail.com [2] https://lore.kernel.org/20250206231103.38298-1-sj@kernel.org This patch (of 2): This effectively adds support for large folios in damon for paddr, as damon_pa_mkold/young won't get a null folio from this function and won't ignore it, hence access will be checked and reported. This also means that larger folios will be considered for different DAMOS actions like pageout, prioritization and migration. As these DAMOS actions will consider larger folios, iterate through the region at folio_size and not PAGE_SIZE intervals. This should not have an affect on vaddr, as damon_young_pmd_entry considers pmd entries. Link: https://lkml.kernel.org/r/20250207212033.45269-1-sj@kernel.org Link: https://lkml.kernel.org/r/20250207212033.45269-2-sj@kernel.org Fixes: a28397beb55b ("mm/damon: implement primitives for physical address space monitoring") Signed-off-by: Usama Arif <usamaarif642@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()Yosry Ahmed1-8/+22
commit c11bcbc0a517acf69282c8225059b2a8ac5fe628 upstream. Currently, zswap_cpu_comp_dead() calls crypto_free_acomp() while holding the per-CPU acomp_ctx mutex. crypto_free_acomp() then holds scomp_lock (through crypto_exit_scomp_ops_async()). On the other hand, crypto_alloc_acomp_node() holds the scomp_lock (through crypto_scomp_init_tfm()), and then allocates memory. If the allocation results in reclaim, we may attempt to hold the per-CPU acomp_ctx mutex. The above dependencies can cause an ABBA deadlock. For example in the following scenario: (1) Task A running on CPU #1: crypto_alloc_acomp_node() Holds scomp_lock Enters reclaim Reads per_cpu_ptr(pool->acomp_ctx, 1) (2) Task A is descheduled (3) CPU #1 goes offline zswap_cpu_comp_dead(CPU #1) Holds per_cpu_ptr(pool->acomp_ctx, 1)) Calls crypto_free_acomp() Waits for scomp_lock (4) Task A running on CPU #2: Waits for per_cpu_ptr(pool->acomp_ctx, 1) // Read on CPU #1 DEADLOCK Since there is no requirement to call crypto_free_acomp() with the per-CPU acomp_ctx mutex held in zswap_cpu_comp_dead(), move it after the mutex is unlocked. Also move the acomp_request_free() and kfree() calls for consistency and to avoid any potential sublte locking dependencies in the future. With this, only setting acomp_ctx fields to NULL occurs with the mutex held. This is similar to how zswap_cpu_comp_prepare() only initializes acomp_ctx fields with the mutex held, after performing all allocations before holding the mutex. Opportunistically, move the NULL check on acomp_ctx so that it takes place before the mutex dereference. Link: https://lkml.kernel.org/r/20250226185625.2672936-1-yosry.ahmed@linux.dev Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev> Reported-by: syzbot+1a517ccfcbc6a7ab0f82@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67bcea51.050a0220.bbfd1.0096.GAE@google.com/ Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Nhat Pham <nphamcs@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Chris Murphy <lists@colorremedies.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10mm/gup: reject FOLL_SPLIT_PMD with hugetlb VMAsDavid Hildenbrand1-0/+3
commit 8977752c8056a6a094a279004a49722da15bace3 upstream. Patch series "mm: fixes for device-exclusive entries (hmm)", v2. Discussing the PageTail() call in make_device_exclusive_range() with Willy, I recently discovered [1] that device-exclusive handling does not properly work with THP, making the hmm-tests selftests fail if THPs are enabled on the system. Looking into more details, I found that hugetlb is not properly fenced, and I realized that something that was bugging me for longer -- how device-exclusive entries interact with mapcounts -- completely breaks migration/swapout/split/hwpoison handling of these folios while they have device-exclusive PTEs. The program below can be used to allocate 1 GiB worth of pages and making them device-exclusive on a kernel with CONFIG_TEST_HMM. Once they are device-exclusive, these folios cannot get swapped out (proc$pid/smaps_rollup will always indicate 1 GiB RSS no matter how much one forces memory reclaim), and when having a memory block onlined to ZONE_MOVABLE, trying to offline it will loop forever and complain about failed migration of a page that should be movable. # echo offline > /sys/devices/system/memory/memory136/state # echo online_movable > /sys/devices/system/memory/memory136/state # ./hmm-swap & ... wait until everything is device-exclusive # echo offline > /sys/devices/system/memory/memory136/state [ 285.193431][T14882] page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x7f20671f7 pfn:0x442b6a [ 285.196618][T14882] memcg:ffff888179298000 [ 285.198085][T14882] anon flags: 0x5fff0000002091c(referenced|uptodate| dirty|active|owner_2|swapbacked|node=1|zone=3|lastcpupid=0x7ff) [ 285.201734][T14882] raw: ... [ 285.204464][T14882] raw: ... [ 285.207196][T14882] page dumped because: migration failure [ 285.209072][T14882] page_owner tracks the page as allocated [ 285.210915][T14882] page last allocated via order 0, migratetype Movable, gfp_mask 0x140dca(GFP_HIGHUSER_MOVABLE|__GFP_COMP|__GFP_ZERO), id 14926, tgid 14926 (hmm-swap), ts 254506295376, free_ts 227402023774 [ 285.216765][T14882] post_alloc_hook+0x197/0x1b0 [ 285.218874][T14882] get_page_from_freelist+0x76e/0x3280 [ 285.220864][T14882] __alloc_frozen_pages_noprof+0x38e/0x2740 [ 285.223302][T14882] alloc_pages_mpol+0x1fc/0x540 [ 285.225130][T14882] folio_alloc_mpol_noprof+0x36/0x340 [ 285.227222][T14882] vma_alloc_folio_noprof+0xee/0x1a0 [ 285.229074][T14882] __handle_mm_fault+0x2b38/0x56a0 [ 285.230822][T14882] handle_mm_fault+0x368/0x9f0 ... This series fixes all issues I found so far. There is no easy way to fix without a bigger rework/cleanup. I have a bunch of cleanups on top (some previous sent, some the result of the discussion in v1) that I will send out separately once this landed and I get to it. I wish we could just use some special present PROT_NONE PTEs instead of these (non-present, non-none) fake-swap entries; but that just results in the same problem we keep having (lack of spare PTE bits), and staring at other similar fake-swap entries, that ship has sailed. With this series, make_device_exclusive() doesn't actually belong into mm/rmap.c anymore, but I'll leave moving that for another day. I only tested this series with the hmm-tests selftests due to lack of HW, so I'd appreciate some testing, especially if the interaction between two GPUs wanting a device-exclusive entry works as expected. <program> #include <stdio.h> #include <fcntl.h> #include <stdint.h> #include <unistd.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <sys/ioctl.h> #include <linux/types.h> #include <linux/ioctl.h> #define HMM_DMIRROR_EXCLUSIVE _IOWR('H', 0x05, struct hmm_dmirror_cmd) struct hmm_dmirror_cmd { __u64 addr; __u64 ptr; __u64 npages; __u64 cpages; __u64 faults; }; const size_t size = 1 * 1024 * 1024 * 1024ul; const size_t chunk_size = 2 * 1024 * 1024ul; int main(void) { struct hmm_dmirror_cmd cmd; size_t cur_size; int fd, ret; char *addr, *mirror; fd = open("/dev/hmm_dmirror1", O_RDWR, 0); if (fd < 0) { perror("open failed\n"); exit(1); } addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (addr == MAP_FAILED) { perror("mmap failed\n"); exit(1); } madvise(addr, size, MADV_NOHUGEPAGE); memset(addr, 1, size); mirror = malloc(chunk_size); for (cur_size = 0; cur_size < size; cur_size += chunk_size) { cmd.addr = (uintptr_t)addr + cur_size; cmd.ptr = (uintptr_t)mirror; cmd.npages = chunk_size / getpagesize(); ret = ioctl(fd, HMM_DMIRROR_EXCLUSIVE, &cmd); if (ret) { perror("ioctl failed\n"); exit(1); } } pause(); return 0; } </program> [1] https://lkml.kernel.org/r/25e02685-4f1d-47fa-be5b-01ff85bb0ce2@redhat.com This patch (of 17): We only have two FOLL_SPLIT_PMD users. While uprobe refuses hugetlb early, make_device_exclusive_range() can end up getting called on hugetlb VMAs. Right now, this means that with a PMD-sized hugetlb page, we can end up calling split_huge_pmd(), because pmd_trans_huge() also succeeds with hugetlb PMDs. For example, using a modified hmm-test selftest one can trigger: [ 207.017134][T14945] ------------[ cut here ]------------ [ 207.018614][T14945] kernel BUG at mm/page_table_check.c:87! [ 207.019716][T14945] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 207.021072][T14945] CPU: 3 UID: 0 PID: ... [ 207.023036][T14945] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 [ 207.024834][T14945] RIP: 0010:page_table_check_clear.part.0+0x488/0x510 [ 207.026128][T14945] Code: ... [ 207.029965][T14945] RSP: 0018:ffffc9000cb8f348 EFLAGS: 00010293 [ 207.031139][T14945] RAX: 0000000000000000 RBX: 00000000ffffffff RCX: ffffffff8249a0cd [ 207.032649][T14945] RDX: ffff88811e883c80 RSI: ffffffff8249a357 RDI: ffff88811e883c80 [ 207.034183][T14945] RBP: ffff888105c0a050 R08: 0000000000000005 R09: 0000000000000000 [ 207.035688][T14945] R10: 00000000ffffffff R11: 0000000000000003 R12: 0000000000000001 [ 207.037203][T14945] R13: 0000000000000200 R14: 0000000000000001 R15: dffffc0000000000 [ 207.038711][T14945] FS: 00007f2783275740(0000) GS:ffff8881f4980000(0000) knlGS:0000000000000000 [ 207.040407][T14945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 207.041660][T14945] CR2: 00007f2782c00000 CR3: 0000000132356000 CR4: 0000000000750ef0 [ 207.043196][T14945] PKRU: 55555554 [ 207.043880][T14945] Call Trace: [ 207.044506][T14945] <TASK> [ 207.045086][T14945] ? __die+0x51/0x92 [ 207.045864][T14945] ? die+0x29/0x50 [ 207.046596][T14945] ? do_trap+0x250/0x320 [ 207.047430][T14945] ? do_error_trap+0xe7/0x220 [ 207.048346][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.049535][T14945] ? handle_invalid_op+0x34/0x40 [ 207.050494][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.051681][T14945] ? exc_invalid_op+0x2e/0x50 [ 207.052589][T14945] ? asm_exc_invalid_op+0x1a/0x20 [ 207.053596][T14945] ? page_table_check_clear.part.0+0x1fd/0x510 [ 207.054790][T14945] ? page_table_check_clear.part.0+0x487/0x510 [ 207.055993][T14945] ? page_table_check_clear.part.0+0x488/0x510 [ 207.057195][T14945] ? page_table_check_clear.part.0+0x487/0x510 [ 207.058384][T14945] __page_table_check_pmd_clear+0x34b/0x5a0 [ 207.059524][T14945] ? __pfx___page_table_check_pmd_clear+0x10/0x10 [ 207.060775][T14945] ? __pfx___mutex_unlock_slowpath+0x10/0x10 [ 207.061940][T14945] ? __pfx___lock_acquire+0x10/0x10 [ 207.062967][T14945] pmdp_huge_clear_flush+0x279/0x360 [ 207.064024][T14945] split_huge_pmd_locked+0x82b/0x3750 ... Before commit 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code"), we would have ignored the flag; instead, let's simply refuse the combination completely in check_vma_flags(): the caller is likely not prepared to handle any hugetlb folios. We'll teach make_device_exclusive_range() separately to ignore any hugetlb folios as a future-proof safety net. Link: https://lkml.kernel.org/r/20250210193801.781278-1-david@redhat.com Link: https://lkml.kernel.org/r/20250210193801.781278-2-david@redhat.com Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Alex Shi <alexs@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Karol Herbst <kherbst@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lyude <lyude@redhat.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yanteng Si <si.yanteng@linux.dev> Cc: Simona Vetter <simona.vetter@ffwll.ch> Cc: Barry Song <v-songbaohua@oppo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-04-10writeback: fix calculations in trace_balance_dirty_pages() for cgwbTang Yizhou1-1/+1
[ Upstream commit 6cc4c3aa714bc58ec5d20f3054ca5f23534984d1 ] In the commit dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") of the cgroup writeback backpressure propagation patchset, Tejun made some adaptations to trace_balance_dirty_pages() for cgroup writeback. However, this adaptation was incomplete and Tejun missed further adaptation in the subsequent patches. In the cgroup writeback scenario, if sdtc in balance_dirty_pages() is assigned to mdtc, then upon entering trace_balance_dirty_pages(), __entry->limit should be assigned based on the dirty_limit of the corresponding memcg's wb_domain, rather than global_wb_domain. To address this issue and simplify the implementation, introduce a 'limit' field in struct dirty_throttle_control to store the hard_limit value computed in wb_position_ratio() by calling hard_dirty_limit(). This field will then be used in trace_balance_dirty_pages() to assign the value to __entry->limit. Link: https://lkml.kernel.org/r/20250304110318.159567-4-yizhou.tang@shopee.com Fixes: dcc25ae76eb7 ("writeback: move global_dirty_limit into wb_domain") Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10writeback: let trace_balance_dirty_pages() take struct dtc as parameterTang Yizhou1-33/+2
[ Upstream commit f1ab2831e2a4312046bca79256b2efc41d373eaf ] Patch series "Fix calculations in trace_balance_dirty_pages() for cgwb", v2. In my experiment, I found that the output of trace_balance_dirty_pages() in the cgroup writeback scenario was strange because trace_balance_dirty_pages() always uses global_wb_domain.dirty_limit for related calculations instead of the dirty_limit of the corresponding memcg's wb_domain. The basic idea of the fix is to store the hard dirty limit value computed in wb_position_ratio() into struct dirty_throttle_control and use it for calculations in trace_balance_dirty_pages(). This patch (of 3): Currently, trace_balance_dirty_pages() already has 12 parameters. In the patch #3, I initially attempted to introduce an additional parameter. However, in include/linux/trace_events.h, bpf_trace_run12() only supports up to 12 parameters and bpf_trace_run13() does not exist. To reduce the number of parameters in trace_balance_dirty_pages(), we can make it accept a pointer to struct dirty_throttle_control as a parameter. To achieve this, we need to move the definition of struct dirty_throttle_control from mm/page-writeback.c to include/linux/writeback.h. Link: https://lkml.kernel.org/r/20250304110318.159567-1-yizhou.tang@shopee.com Link: https://lkml.kernel.org/r/20250304110318.159567-2-yizhou.tang@shopee.com Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Tang Yizhou <yizhou.tang@shopee.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 6cc4c3aa714b ("writeback: fix calculations in trace_balance_dirty_pages() for cgwb") Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()David Hildenbrand1-7/+4
[ Upstream commit dc84bc2aba85a1508f04a936f9f9a15f64ebfb31 ] If track_pfn_copy() fails, we already added the dst VMA to the maple tree. As fork() fails, we'll cleanup the maple tree, and stumble over the dst VMA for which we neither performed any reservation nor copied any page tables. Consequently untrack_pfn() will see VM_PAT and try obtaining the PAT information from the page table -- which fails because the page table was not copied. The easiest fix would be to simply clear the VM_PAT flag of the dst VMA if track_pfn_copy() fails. However, the whole thing is about "simply" clearing the VM_PAT flag is shaky as well: if we passed track_pfn_copy() and performed a reservation, but copying the page tables fails, we'll simply clear the VM_PAT flag, not properly undoing the reservation ... which is also wrong. So let's fix it properly: set the VM_PAT flag only if the reservation succeeded (leaving it clear initially), and undo the reservation if anything goes wrong while copying the page tables: clearing the VM_PAT flag after undoing the reservation. Note that any copied page table entries will get zapped when the VMA will get removed later, after copy_page_range() succeeded; as VM_PAT is not set then, we won't try cleaning VM_PAT up once more and untrack_pfn() will be happy. Note that leaving these page tables in place without a reservation is not a problem, as we are aborting fork(); this process will never run. A reproducer can trigger this usually at the first try: https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/reproducers/pat_fork.c WARNING: CPU: 26 PID: 11650 at arch/x86/mm/pat/memtype.c:983 get_pat_info+0xf6/0x110 Modules linked in: ... CPU: 26 UID: 0 PID: 11650 Comm: repro3 Not tainted 6.12.0-rc5+ #92 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:get_pat_info+0xf6/0x110 ... Call Trace: <TASK> ... untrack_pfn+0x52/0x110 unmap_single_vma+0xa6/0xe0 unmap_vmas+0x105/0x1f0 exit_mmap+0xf6/0x460 __mmput+0x4b/0x120 copy_process+0x1bf6/0x2aa0 kernel_clone+0xab/0x440 __do_sys_clone+0x66/0x90 do_syscall_64+0x95/0x180 Likely this case was missed in: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") ... and instead of undoing the reservation we simply cleared the VM_PAT flag. Keep the documentation of these functions in include/linux/pgtable.h, one place is more than sufficient -- we should clean that up for the other functions like track_pfn_remap/untrack_pfn separately. Fixes: d155df53f310 ("x86/mm/pat: clear VM_PAT if copy_p4d_range failed") Fixes: 2ab640379a0a ("x86: PAT: hooks in generic vm code to help archs to track pfnmap regions - v3") Reported-by: xingwei lee <xrivendell7@gmail.com> Reported-by: yuxin wang <wang1315768607@163.com> Reported-by: Marius Fleischer <fleischermarius@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20250321112323.153741-1-david@redhat.com Closes: https://lore.kernel.org/lkml/CABOYnLx_dnqzpCW99G81DmOr+2UzdmZMk=T3uxwNxwz+R1RAwg@mail.gmail.com/ Closes: https://lore.kernel.org/lkml/CAJg=8jwijTP5fre8woS4JVJQ8iUA6v+iNcsOgtj9Zfpc3obDOQ@mail.gmail.com/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-04-10lockdep/mm: Fix might_fault() lockdep check of current->mm->mmap_lockPeter Zijlstra1-2/+0
[ Upstream commit a1b65f3f7c6f7f0a08a7dba8be458c6415236487 ] Turns out that this commit, about 10 years ago: 9ec23531fd48 ("sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults") ... accidentally (and unnessecarily) put the lockdep part of __might_fault() under CONFIG_DEBUG_ATOMIC_SLEEP=y. This is potentially notable because large distributions such as Ubuntu are running with !CONFIG_DEBUG_ATOMIC_SLEEP. Restore the debug check. [ mingo: Update changelog. ] Fixes: 9ec23531fd48 ("sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20241104135517.536628371@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-18Merge tag 'mm-hotfixes-stable-2025-03-17-20-09' of ↵Linus Torvalds10-30/+77
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes. 7 are cc:stable and the remainder address post-6.13 issues or aren't considered necessary for -stable kernels. 13 are for MM and the other two are for squashfs and procfs. All are singletons. Please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2025-03-17-20-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page_alloc: fix memory accept before watermarks gets initialized mm: decline to manipulate the refcount on a slab page memcg: drain obj stock on cpu hotplug teardown mm/huge_memory: drop beyond-EOF folios with the right number of refs selftests/mm: run_vmtests.sh: fix half_ufd_size_MB calculation mm: fix error handling in __filemap_get_folio() with FGP_NOWAIT mm: memcontrol: fix swap counter leak from offline cgroup mm/vma: do not register private-anon mappings with khugepaged during mmap squashfs: fix invalid pointer dereference in squashfs_cache_delete mm/migrate: fix shmem xarray update during migration mm/hugetlb: fix surplus pages in dissolve_free_huge_page() mm/damon/core: initialize damos->walk_completed in damon_new_scheme() mm/damon: respect core layer filters' allowance decision on ops layer filemap: move prefaulting out of hot write path proc: fix UAF in proc_get_inode()
2025-03-17mm/page_alloc: fix memory accept before watermarks gets initializedKirill A. Shutemov1-2/+12
Watermarks are initialized during the postcore initcall. Until then, all watermarks are set to zero. This causes cond_accept_memory() to incorrectly skip memory acceptance because a watermark of 0 is always met. This can lead to a premature OOM on boot. To ensure progress, accept one MAX_ORDER page if the watermark is zero. Link: https://lkml.kernel.org/r/20250310082855.2587122-1-kirill.shutemov@linux.intel.com Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Reported-by: Farrah Chen <farrah.chen@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Mike Rapoport (IBM)" <rppt@kernel.org> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17memcg: drain obj stock on cpu hotplug teardownShakeel Butt1-0/+9
Currently on cpu hotplug teardown, only memcg stock is drained but we need to drain the obj stock as well otherwise we will miss the stats accumulated on the target cpu as well as the nr_bytes cached. The stats include MEMCG_KMEM, NR_SLAB_RECLAIMABLE_B & NR_SLAB_UNRECLAIMABLE_B. In addition we are leaking reference to struct obj_cgroup object. Link: https://lkml.kernel.org/r/20250310230934.2913113-1-shakeel.butt@linux.dev Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/huge_memory: drop beyond-EOF folios with the right number of refsZi Yan1-1/+1
When an after-split folio is large and needs to be dropped due to EOF, folio_put_refs(folio, folio_nr_pages(folio)) should be used to drop all page cache refs. Otherwise, the folio will not be freed, causing memory leak. This leak would happen on a filesystem with blocksize > page_size and a truncate is performed, where the blocksize makes folios split to >0 order ones, causing truncated folios not being freed. Link: https://lkml.kernel.org/r/20250310155727.472846-1-ziy@nvidia.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Hugh Dickins <hughd@google.com> Closes: https://lore.kernel.org/all/fcbadb7f-dd3e-21df-f9a7-2853b53183c4@google.com/ Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Yu Zhao <yuzhao@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: fix error handling in __filemap_get_folio() with FGP_NOWAITRaphael S. Carvalho1-1/+12
original report: https://lore.kernel.org/all/CAKhLTr1UL3ePTpYjXOx2AJfNk8Ku2EdcEfu+CH1sf3Asr=B-Dw@mail.gmail.com/T/ When doing buffered writes with FGP_NOWAIT, under memory pressure, the system returned ENOMEM despite there being plenty of available memory, to be reclaimed from page cache. The user space used io_uring interface, which in turn submits I/O with FGP_NOWAIT (the fast path). retsnoop pointed to iomap_get_folio: 00:34:16.180612 -> 00:34:16.180651 TID/PID 253786/253721 (reactor-1/combined_tests): entry_SYSCALL_64_after_hwframe+0x76 do_syscall_64+0x82 __do_sys_io_uring_enter+0x265 io_submit_sqes+0x209 io_issue_sqe+0x5b io_write+0xdd xfs_file_buffered_write+0x84 iomap_file_buffered_write+0x1a6 32us [-ENOMEM] iomap_write_begin+0x408 iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 foliop=0xffffb32c296b7b80 ! 4us [-ENOMEM] iomap_get_folio iter=&{.inode=0xffff8c67aa031138,.len=4096,.flags=33,.iomap={.addr=0xffffffffffffffff,.length=4096,.type=1,.flags=3,.bdev=0x… pos=0 len=4096 This is likely a regression caused by 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio"), which moved error handling from io_map_get_folio() to __filemap_get_folio(), but broke FGP_NOWAIT handling, so ENOMEM is being escaped to user space. Had it correctly returned -EAGAIN with NOWAIT, either io_uring or user space itself would be able to retry the request. It's not enough to patch io_uring since the iomap interface is the one responsible for it, and pwritev2(RWF_NOWAIT) and AIO interfaces must return the proper error too. The patch was tested with scylladb test suite (its original reproducer), and the tests all pass now when memory is pressured. Link: https://lkml.kernel.org/r/20250224143700.23035-1-raphaelsc@scylladb.com Fixes: 66dabbb65d67 ("mm: return an ERR_PTR from __filemap_get_folio") Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm: memcontrol: fix swap counter leak from offline cgroupMuchun Song2-5/+6
Commit 6769183166b3 removed the parameter of id from swap_cgroup_record() and get the memcg id from mem_cgroup_id(folio_memcg(folio)). However, the caller of it may update a different memcg's counter instead of folio_memcg(folio). E.g. in the caller of mem_cgroup_swapout(), @swap_memcg could be different with @memcg and update the counter of @swap_memcg, but swap_cgroup_record() records the wrong memcg's ID. When it is uncharged from __mem_cgroup_uncharge_swap(), the swap counter will leak since the wrong recorded ID. Fix it by bringing the parameter of id back. Link: https://lkml.kernel.org/r/20250306023133.44838-1-songmuchun@bytedance.com Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Kairui Song <kasong@tencent.com> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/vma: do not register private-anon mappings with khugepaged during mmapDev Jain1-1/+2
We already are registering private-anon VMAs with khugepaged during fault time, in do_huge_pmd_anonymous_page(). Commit "register suitable readonly file vmas for khugepaged" moved the khugepaged registration logic from shmem_mmap to the generic mmap path. The userspace-visible effect should be this: khugepaged will unnecessarily scan mm's which haven't yet faulted in. Note that it won't actually collapse because all PTEs are none. Now that I think about it, the mm is going to have a file VMA anyways during fork+exec, so the mm already gets registered during mmap due to the non-anon case (I *think*), so at least one of either the mmap registration or fault-time registration is redundant. Make this logic specific for non-anon mappings. Link: https://lkml.kernel.org/r/20250306063037.16299-1-dev.jain@arm.com Fixes: 613bec092fe7 ("mm: mmap: register suitable readonly file vmas for khugepaged") Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/migrate: fix shmem xarray update during migrationZi Yan1-6/+4
A shmem folio can be either in page cache or in swap cache, but not at the same time. Namely, once it is in swap cache, folio->mapping should be NULL, and the folio is no longer in a shmem mapping. In __folio_migrate_mapping(), to determine the number of xarray entries to update, folio_test_swapbacked() is used, but that conflates shmem in page cache case and shmem in swap cache case. It leads to xarray multi-index entry corruption, since it turns a sibling entry to a normal entry during xas_store() (see [1] for a userspace reproduction). Fix it by only using folio_test_swapcache() to determine whether xarray is storing swap cache entries or not to choose the right number of xarray entries to update. [1] https://lore.kernel.org/linux-mm/Z8idPCkaJW1IChjT@casper.infradead.org/ Note: In __split_huge_page(), folio_test_anon() && folio_test_swapcache() is used to get swap_cache address space, but that ignores the shmem folio in swap cache case. It could lead to NULL pointer dereferencing when a in-swap-cache shmem folio is split at __xa_store(), since !folio_test_anon() is true and folio->mapping is NULL. But fortunately, its caller split_huge_page_to_list_to_order() bails out early with EBUSY when folio->mapping is NULL. So no need to take care of it here. Link: https://lkml.kernel.org/r/20250305200403.2822855-1-ziy@nvidia.com Fixes: fc346d0a70a1 ("mm: migrate high-order folios in swap cache correctly") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: Liu Shixin <liushixin2@huawei.com> Closes: https://lore.kernel.org/all/28546fb4-5210-bf75-16d6-43e1f8646080@huawei.com/ Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Charan Teja Kalla <quic_charante@quicinc.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/hugetlb: fix surplus pages in dissolve_free_huge_page()Jinjiang Tu1-2/+6
In dissolve_free_huge_page(), free huge pages are dissolved without adjusting surplus count. However, free huge pages may be accounted as surplus pages, and will lead to wrong surplus count. I reproduce this issue on qemu. The steps are: 1) Node1 is memory-less at first. Hot-add memory to node1 by executing the two commands in qemu monitor: object_add memory-backend-ram,id=mem1,size=1G device_add pc-dimm,id=dimm1,memdev=mem1,node=1 2) online one memory block of Node1 with: echo online_movable > /sys/devices/system/node/node1/memoryX/state 3) create 64 huge pages for node1 4) run a program to reserve (don't consume) all the huge pages 5) echo 0 > nr_huge_pages for node1. After this step, free huge pages in Node1 are surplus. 6) create 80 huge pages for node0 7) offline memory of node1, The memory range to offline contains the free surplus huge pages created in step3) ~ step5) echo offline > /sys/devices/system/node/node1/memoryX/state 8) kill the program in step 4) The result: Node0 Node1 total 80 0 free 80 0 surplus 0 61 To fix it, adjust surplus when destroying huge pages if the node has surplus pages in dissolve_free_hugetlb_folio(). The result with this patch: Node0 Node1 total 80 0 free 80 0 surplus 0 0 Link: https://lkml.kernel.org/r/20250304132106.2872754-1-tujinjiang@huawei.com Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage") Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jinjiang Tu <tujinjiang@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nanyong Sun <sunnanyong@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-03-17mm/damon/core: initialize damos->walk_completed in damon_new_scheme()SeongJae Park1-0/+1
The function for allocating and initialize a 'struct damos' object, damon_new_scheme(), is not initializing damos->walk_completed field. Only damos_walk_complete() is setting the field. Hence the field will be eventually set and used correctly from second damos_walk() call for the scheme. But the first damos_walk() could mistakenly not walk on the regions. Actually, a common usage of DAMOS for taking an access pattern snapshot is installing a monitoring-purpose DAMOS scheme, doing damos_walk() to retrieve the snapshot, and then removing the scheme. DAMON user-space tool (damo) also gets runtime snapshot in the way. Hence the problem can continuously happen in such use cases. Initialize it properly in the allocation function. Link: https://lkml.kernel.org/r/20250228174450.41472-1-sj@kernel.org Fixes: bf0eaba0ff9c ("mm/damon/core: implement damos_walk()") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>