kernel/linux.git/mm/memory.c, branch v5.15.209

mm: fix apply_to_existing_page_range()

2025-05-02T05:44:24+00:00

commit a995199384347261bb3f21b2e171fa7f988bd2f8 upstream. In the case of apply_to_existing_page_range(), apply_to_pte_range() is reached with 'create' set to false. When !create, the loop over the PTE page table is broken. apply_to_pte_range() will only move to the next PTE entry if 'create' is true or if the current entry is not pte_none(). This means that the user of apply_to_existing_page_range() will not have 'fn' called for any entries after the first pte_none() in the PTE page table. Fix the loop logic in apply_to_pte_range(). There are no known runtime issues from this, but the fix is trivial enough for stable@ even without a known buggy user. Link: https://lkml.kernel.org/r/20250409094043.1629234-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper") Cc: Daniel Axtens Cc: David Hildenbrand Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton Signed-off-by: Kirill A. Shutemov Signed-off-by: Greg Kroah-Hartman

mm: don't skip arch_sync_kernel_mappings() in error paths

2025-03-13T11:51:03+00:00

commit 3685024edd270f7c791f993157d65d3c928f3d6e upstream. Fix callers that previously skipped calling arch_sync_kernel_mappings() if an error occurred during a pgtable update. The call is still required to sync any pgtable updates that may have occurred prior to hitting the error condition. These are theoretical bugs discovered during code review. Link: https://lkml.kernel.org/r/20250226121610.2401743-1-ryan.roberts@arm.com Fixes: 2ba3e6947aed ("mm/vmalloc: track which page-table levels were modified") Fixes: 0c95cba49255 ("mm: apply_to_pte_range warn and fail if a large pte is encountered") Signed-off-by: Ryan Roberts Reviewed-by: Anshuman Khandual Reviewed-by: Catalin Marinas Cc: Christop Hellwig Cc: "Uladzislau Rezki (Sony)" Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/memory: add non-anonymous page check in the copy_present_page()

2024-11-17T14:06:26+00:00

The vma->anon_vma of the child process may be NULL because the entire vma does not contain anonymous pages. In this case, a BUG will occur when the copy_present_page() passes a copy of a non-anonymous page of that vma to the page_add_new_anon_rmap() to set up new anonymous rmap. ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1052! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 4 PID: 4652 Comm: test Not tainted 5.15.75 #1 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : __page_set_anon_rmap+0xc0/0xe8 lr : __page_set_anon_rmap+0xc0/0xe8 sp : ffff80000e773860 x29: ffff80000e773860 x28: fffffc13cf006ec0 x27: ffff04f3ccd68000 x26: ffff04f3c5c33248 x25: 0000000010100073 x24: ffff04f3c53c0a80 x23: 0000000020000000 x22: 0000000000000001 x21: 0000000020000000 x20: fffffc13cf006ec0 x19: 0000000000000000 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffdddc5581377c x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff2717a8433000 x5 : ffff80000e773810 x4 : ffffdddc55400000 x3 : 0000000000000000 x2 : ffffdddc56b20000 x1 : ffff04f3c9a48040 x0 : 0000000000000000 Call trace: __page_set_anon_rmap+0xc0/0xe8 page_add_new_anon_rmap+0x13c/0x200 copy_pte_range+0x6b8/0x1018 copy_page_range+0x3a8/0x5e0 dup_mmap+0x3a0/0x6e8 dup_mm+0x78/0x140 copy_process+0x1528/0x1b08 kernel_clone+0xac/0x610 __do_sys_clone+0x78/0xb0 __arm64_sys_clone+0x30/0x40 invoke_syscall+0x68/0x170 el0_svc_common.constprop.0+0x80/0x250 do_el0_svc+0x48/0xb8 el0_svc+0x48/0x1a8 el0t_64_sync_handler+0xb0/0xb8 el0t_64_sync+0x1a0/0x1a4 Code: 97f899f4 f9400273 17ffffeb 97f899f1 (d4210000) ---[ end trace dc65e5edd0f362fa ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x5ddc4d400000 from 0xffff800008000000 PHYS_OFFSET: 0xfffffb0c80000000 CPU features: 0x44000cf1,00000806 Memory Limit: none ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- This problem has been fixed by the commit ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()"), but still exists in the linux-5.15.y branch. This patch is not applicable to this version because of the large version differences. Therefore, fix it by adding non-anonymous page check in the copy_present_page(). Cc: stable@vger.kernel.org Fixes: 70e806e4e645 ("mm: Do early cow for pinned pages during fork() for ptes") Signed-off-by: Yuanzheng Song Signed-off-by: Vlastimil Babka Reviewed-by: David Hildenbrand Signed-off-by: Greg Kroah-Hartman

mm: avoid leaving partial pfn mappings around in error case

2024-10-17T13:10:34+00:00

commit 79a61cc3fc0466ad2b7b89618a6157785f0293b3 upstream. As Jann points out, PFN mappings are special, because unlike normal memory mappings, there is no lifetime information associated with the mapping - it is just a raw mapping of PFNs with no reference counting of a 'struct page'. That's all very much intentional, but it does mean that it's easy to mess up the cleanup in case of errors. Yes, a failed mmap() will always eventually clean up any partial mappings, but without any explicit lifetime in the page table mapping itself, it's very easy to do the error handling in the wrong order. In particular, it's easy to mistakenly free the physical backing store before the page tables are actually cleaned up and (temporarily) have stale dangling PTE entries. To make this situation less error-prone, just make sure that any partial pfn mapping is torn down early, before any other error handling. Reported-and-tested-by: Jann Horn Cc: Andrew Morton Cc: Jason Gunthorpe Cc: Simona Vetter Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

mm/numa: no task_numa_fault() call if PTE is changed

2024-09-04T11:23:37+00:00

commit 40b760cfd44566bca791c80e0720d70d75382b84 upstream. When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") restructured do_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pte_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") Signed-off-by: Zi Yan Reported-by: "Huang, Ying" Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Acked-by: David Hildenbrand Cc: Baolin Wang Cc: Kefeng Wang Cc: Mel Gorman Cc: Yang Shi Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

x86/mm/pat: fix VM_PAT handling in COW mappings

2024-04-13T11:01:47+00:00

commit 04c35ab3bdae7fefbd7c7a7355f29fa03a035221 upstream. PAT handling won't do the right thing in COW mappings: the first PTE (or, in fact, all PTEs) can be replaced during write faults to point at anon folios. Reliably recovering the correct PFN and cachemode using follow_phys() from PTEs will not work in COW mappings. Using follow_phys(), we might just get the address+protection of the anon folio (which is very wrong), or fail on swap/nonswap entries, failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn() and track_pfn_copy(), not properly calling free_pfn_range(). In free_pfn_range(), we either wouldn't call memtype_free() or would call it with the wrong range, possibly leaking memory. To fix that, let's update follow_phys() to refuse returning anon folios, and fallback to using the stored PFN inside vma->vm_pgoff for COW mappings if we run into that. We will now properly handle untrack_pfn() with COW mappings, where we don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if the first page was replaced by an anon folio, though: we'd have to store the cachemode in the VMA to make this work, likely growing the VMA size. For now, lets keep it simple and let track_pfn_copy() just fail in that case: it would have failed in the past with swap/nonswap entries already, and it would have done the wrong thing with anon folios. Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn(): <--- C reproducer ---> #include #include #include #include int main(void) { struct io_uring_params p = {}; int ring_fd; size_t size; char *map; ring_fd = io_uring_setup(1, &p); if (ring_fd < 0) { perror("io_uring_setup"); return 1; } size = p.sq_off.array + p.sq_entries * sizeof(unsigned); /* Map the submission queue ring MAP_PRIVATE */ map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, ring_fd, IORING_OFF_SQ_RING); if (map == MAP_FAILED) { perror("mmap"); return 1; } /* We have at least one page. Let's COW it. */ *map = 0; pause(); return 0; } <--- C reproducer ---> On a system with 16 GiB RAM and swap configured: # ./iouring & # memhog 16G # killall iouring [ 301.552930] ------------[ cut here ]------------ [ 301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100 [ 301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g [ 301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1 [ 301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4 [ 301.559569] RIP: 0010:untrack_pfn+0xf4/0x100 [ 301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000 [ 301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282 [ 301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047 [ 301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200 [ 301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000 [ 301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000 [ 301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000 [ 301.564186] FS: 0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000 [ 301.564773] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0 [ 301.565725] PKRU: 55555554 [ 301.565944] Call Trace: [ 301.566148] [ 301.566325] ? untrack_pfn+0xf4/0x100 [ 301.566618] ? __warn+0x81/0x130 [ 301.566876] ? untrack_pfn+0xf4/0x100 [ 301.567163] ? report_bug+0x171/0x1a0 [ 301.567466] ? handle_bug+0x3c/0x80 [ 301.567743] ? exc_invalid_op+0x17/0x70 [ 301.568038] ? asm_exc_invalid_op+0x1a/0x20 [ 301.568363] ? untrack_pfn+0xf4/0x100 [ 301.568660] ? untrack_pfn+0x65/0x100 [ 301.568947] unmap_single_vma+0xa6/0xe0 [ 301.569247] unmap_vmas+0xb5/0x190 [ 301.569532] exit_mmap+0xec/0x340 [ 301.569801] __mmput+0x3e/0x130 [ 301.570051] do_exit+0x305/0xaf0 ... Link: https://lkml.kernel.org/r/20240403212131.929421-3-david@redhat.com Signed-off-by: David Hildenbrand Reported-by: Wupeng Ma Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines") Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3") Acked-by: Ingo Molnar Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Signed-off-by: Andrew Morton Signed-off-by: David Hildenbrand Signed-off-by: Greg Kroah-Hartman

mm: fix unmap_mapping_range high bits shift bug

2024-01-15T17:51:23+00:00

commit 9eab0421fa94a3dde0d1f7e36ab3294fc306c99d upstream. The bug happens when highest bit of holebegin is 1, suppose holebegin is 0x8000000111111000, after shift, hba would be 0xfff8000000111111, then vma_interval_tree_foreach would look it up fail or leads to the wrong result. error call seq e.g.: - mmap(..., offset=0x8000000111111000) |- syscall(mmap, ... unsigned long, off): |- ksys_mmap_pgoff( ... , off >> PAGE_SHIFT); here pgoff is correctly shifted to 0x8000000111111, but pass 0x8000000111111000 as holebegin to unmap would then cause terrible result, as shown below: - unmap_mapping_range(..., loff_t const holebegin) |- pgoff_t hba = holebegin >> PAGE_SHIFT; /* hba = 0xfff8000000111111 unexpectedly */ The issue happens in Heterogeneous computing, where the device(e.g. gpu) and host share the same virtual address space. A simple workflow pattern which hit the issue is: /* host */ 1. userspace first mmap a file backed VA range with specified offset. e.g. (offset=0x800..., mmap return: va_a) 2. write some data to the corresponding sys page e.g. (va_a = 0xAABB) /* device */ 3. gpu workload touches VA, triggers gpu fault and notify the host. /* host */ 4. reviced gpu fault notification, then it will: 4.1 unmap host pages and also takes care of cpu tlb (use unmap_mapping_range with offset=0x800...) 4.2 migrate sys page to device 4.3 setup device page table and resolve device fault. /* device */ 5. gpu workload continued, it accessed va_a and got 0xAABB. 6. gpu workload continued, it wrote 0xBBCC to va_a. /* host */ 7. userspace access va_a, as expected, it will: 7.1 trigger cpu vm fault. 7.2 driver handling fault to migrate gpu local page to host. 8. userspace then could correctly get 0xBBCC from va_a 9. done But in step 4.1, if we hit the bug this patch mentioned, then userspace would never trigger cpu fault, and still get the old value: 0xAABB. Making holebegin unsigned first fixes the bug. Link: https://lkml.kernel.org/r/20231220052839.26970-1-jiajun.xie.sh@gmail.com Signed-off-by: Jiajun Xie Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm, hwpoison: when copy-on-write hits poison, take page offline

2023-07-05T17:25:04+00:00

From: Tony Luck commit d302c2398ba269e788a4f37ae57c07a7fcabaa42 upstream. Cannot call memory_failure() directly from the fault handler because mmap_lock (and others) are held. It is important, but not urgent, to mark the source page as h/w poisoned and unmap it from other tasks. Use memory_failure_queue() to request a call to memory_failure() for the page with the error. Also provide a stub version for CONFIG_MEMORY_FAILURE=n Cc: Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.com Signed-off-by: Tony Luck Reviewed-by: Miaohe Lin Cc: Christophe Leroy Cc: Dan Williams Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Naoya Horiguchi Cc: Nicholas Piggin Cc: Shuai Xue Signed-off-by: Andrew Morton [ Due to missing commits e591ef7d96d6e ("mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage") 5033091de814a ("mm/hwpoison: introduce per-memory_block hwpoison counter") The impact of e591ef7d96d6e is its introduction of an additional flag in __get_huge_page_for_hwpoison() that serves as an indication a hwpoisoned hugetlb page should have its migratable bit cleared. The impact of 5033091de814a is contexual. Resolve by ignoring both missing commits. - jane] Signed-off-by: Jane Chu Signed-off-by: Greg Kroah-Hartman

mm, hwpoison: try to recover from copy-on write faults

2023-07-05T17:25:04+00:00

commit a873dfe1032a132bf89f9e19a6ac44f5a0b78754 upstream. Patch series "Copy-on-write poison recovery", v3. Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin and Shuai Xue for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: This patch (of 2): If the kernel is copying a page as the result of a copy-on-write fault and runs into an uncorrectable error, Linux will crash because it does not have recovery code for this case where poison is consumed by the kernel. It is easy to set up a test case. Just inject an error into a private page, fork(2), and have the child process write to the page. I wrapped that neatly into a test at: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git just enable ACPI error injection and run: # ./einj_mem-uc -f copy-on-write Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel() on architectures where that is available (currently x86 and powerpc). When an error is detected during the page copy, return VM_FAULT_HWPOISON to caller of wp_page_copy(). This propagates up the call stack. Both x86 and powerpc have code in their fault handler to deal with this code by sending a SIGBUS to the application. Note that this patch avoids a system crash and signals the process that triggered the copy-on-write action. It does not take any action for the memory error that is still in the shared page. To handle that a call to memory_failure() is needed. But this cannot be done from wp_page_copy() because it holds mmap_lock(). Perhaps the architecture fault handlers can deal with this loose end in a subsequent patch? On Intel/x86 this loose end will often be handled automatically because the memory controller provides an additional notification of the h/w poison in memory, the handler for this will call memory_failure(). This isn't a 100% solution. If there are multiple errors, not all may be logged in this way. Cc: [tony.luck@intel.com: add call to kmsan_unpoison_memory(), per Miaohe Lin] Link: https://lkml.kernel.org/r/20221031201029.102123-2-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-1-tony.luck@intel.com Link: https://lkml.kernel.org/r/20221021200120.175753-2-tony.luck@intel.com Signed-off-by: Tony Luck Reviewed-by: Dan Williams Reviewed-by: Naoya Horiguchi Reviewed-by: Miaohe Lin Reviewed-by: Alexander Potapenko Tested-by: Shuai Xue Cc: Christophe Leroy Cc: Matthew Wilcox (Oracle) Cc: Michael Ellerman Cc: Nicholas Piggin Signed-off-by: Andrew Morton [ Due to missing commits c89357e27f20d ("mm: support GUP-triggered unsharing of anonymous pages") 662ce1dc9caf4 ("delayacct: track delays from write-protect copy") b073d7f8aee4e ("mm: kmsan: maintain KMSAN metadata for page operations") The impact of c89357e27f20d is a name change from cow_user_page() to __wp_page_copy_user(). The impact of 662ce1dc9caf4 is the introduction of a new feature of tracking write-protect copy in delayacct. The impact of b073d7f8aee4e is an introduction of KASAN feature. None of these commits establishes meaningful dependency, hence resolve by ignoring them. - jane] Signed-off-by: Jane Chu Signed-off-by: Greg Kroah-Hartman

mm: take a page reference when removing device exclusive entries

2023-04-13T14:48:26+00:00

commit 7c7b962938ddda6a9cd095de557ee5250706ea88 upstream. Device exclusive page table entries are used to prevent CPU access to a page whilst it is being accessed from a device. Typically this is used to implement atomic operations when the underlying bus does not support atomic access. When a CPU thread encounters a device exclusive entry it locks the page and restores the original entry after calling mmu notifiers to signal drivers that exclusive access is no longer available. The device exclusive entry holds a reference to the page making it safe to access the struct page whilst the entry is present. However the fault handling code does not hold the PTL when taking the page lock. This means if there are multiple threads faulting concurrently on the device exclusive entry one will remove the entry whilst others will wait on the page lock without holding a reference. This can lead to threads locking or waiting on a folio with a zero refcount. Whilst mmap_lock prevents the pages getting freed via munmap() they may still be freed by a migration. This leads to warnings such as PAGE_FLAGS_CHECK_AT_FREE due to the page being locked when the refcount drops to zero. Fix this by trying to take a reference on the folio before locking it. The code already checks the PTE under the PTL and aborts if the entry is no longer there. It is also possible the folio has been unmapped, freed and re-allocated allowing a reference to be taken on an unrelated folio. This case is also detected by the PTE check and the folio is unlocked without further changes. Link: https://lkml.kernel.org/r/20230330012519.804116-1-apopple@nvidia.com Fixes: b756a3b5e7ea ("mm: device exclusive memory access") Signed-off-by: Alistair Popple Reviewed-by: Ralph Campbell Reviewed-by: John Hubbard Acked-by: David Hildenbrand Cc: Matthew Wilcox (Oracle) Cc: Christoph Hellwig Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman