summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-09-15Revert "mm: kmemleak: take a full lowmem check in kmemleak_*_phys()"Yee Lee1-4/+4
This reverts commit 23c2d497de21f25898fbea70aeb292ab8acc8c94. Commit 23c2d497de21 ("mm: kmemleak: take a full lowmem check in kmemleak_*_phys()") brought false leak alarms on some archs like arm64 that does not init pfn boundary in early booting. The final solution lands on linux-6.0: commit 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA"). Revert this commit before linux-6.0. The original issue of invalid PA can be mitigated by additional check in devicetree. The false alarm report is as following: Kmemleak output: (Qemu/arm64) unreferenced object 0xffff0000c0170a00 (size 128): comm "swapper/0", pid 1, jiffies 4294892404 (age 126.208s) hex dump (first 32 bytes): 62 61 73 65 00 00 00 00 00 00 00 00 00 00 00 00 base............ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<(____ptrval____)>] __kmalloc_track_caller+0x1b0/0x2e4 [<(____ptrval____)>] kstrdup_const+0x8c/0xc4 [<(____ptrval____)>] kvasprintf_const+0xbc/0xec [<(____ptrval____)>] kobject_set_name_vargs+0x58/0xe4 [<(____ptrval____)>] kobject_add+0x84/0x100 [<(____ptrval____)>] __of_attach_node_sysfs+0x78/0xec [<(____ptrval____)>] of_core_init+0x68/0x104 [<(____ptrval____)>] driver_init+0x28/0x48 [<(____ptrval____)>] do_basic_setup+0x14/0x28 [<(____ptrval____)>] kernel_init_freeable+0x110/0x178 [<(____ptrval____)>] kernel_init+0x20/0x1a0 [<(____ptrval____)>] ret_from_fork+0x10/0x20 This pacth is also applicable to linux-5.17.y/linux-5.18.y/linux-5.19.y Cc: <stable@vger.kernel.org> Signed-off-by: Yee Lee <yee.lee@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-09-08mm: pagewalk: Fix race between unmap and page walkerSteven Price2-11/+14
[ Upstream commit 8782fb61cc848364e1e1599d76d3c9dd58a1cc06 ] The mmap lock protects the page walker from changes to the page tables during the walk. However a read lock is insufficient to protect those areas which don't have a VMA as munmap() detaches the VMAs before downgrading to a read lock and actually tearing down PTEs/page tables. For users of walk_page_range() the solution is to simply call pte_hole() immediately without checking the actual page tables when a VMA is not present. We now never call __walk_page_range() without a valid vma. For walk_page_range_novma() the locking requirements are tightened to require the mmap write lock to be taken, and then walking the pgd directly with 'no_vma' set. This in turn means that all page walkers either have a valid vma, or it's that special 'novma' case for page table debugging. As a result, all the odd '(!walk->vma && !walk->no_vma)' tests can be removed. Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-08mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding ↵Waiman Long1-16/+29
slab_mutex/cpu_hotplug_lock [ Upstream commit 0495e337b7039191dfce6e03f5f830454b1fae6b ] A circular locking problem is reported by lockdep due to the following circular locking dependency. +--> cpu_hotplug_lock --> slab_mutex --> kn->active --+ | | +-----------------------------------------------------+ The forward cpu_hotplug_lock ==> slab_mutex ==> kn->active dependency happens in kmem_cache_destroy(): cpus_read_lock(); mutex_lock(&slab_mutex); ==> sysfs_slab_unlink() ==> kobject_del() ==> kernfs_remove() ==> __kernfs_remove() ==> kernfs_drain(): rwsem_acquire(&kn->dep_map, ...); The backward kn->active ==> cpu_hotplug_lock dependency happens in kernfs_fop_write_iter(): kernfs_get_active(); ==> slab_attr_store() ==> cpu_partial_store() ==> flush_all(): cpus_read_lock() One way to break this circular locking chain is to avoid holding cpu_hotplug_lock and slab_mutex while deleting the kobject in sysfs_slab_unlink() which should be equivalent to doing a write_lock and write_unlock pair of the kn->active virtual lock. Since the kobject structures are not protected by slab_mutex or the cpu_hotplug_lock, we can certainly release those locks before doing the delete operation. Move sysfs_slab_unlink() and sysfs_slab_release() to the newly created kmem_cache_release() and call it outside the slab_mutex & cpu_hotplug_lock critical sections. There will be a slight delay in the deletion of sysfs files if kmem_cache_release() is called indirectly from a work function. Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Link: https://lore.kernel.org/all/YwOImVd+nRUsSAga@hyeyoo/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-09-05mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuseJann Horn1-13/+16
commit 2555283eb40df89945557273121e9393ef9b542b upstream. anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma. anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma. This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA. This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks. Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma. Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/hugetlb: fix hugetlb not supporting softdirty trackingDavid Hildenbrand1-2/+6
commit f96f7a40874d7c746680c0b9f57cef2262ae551f upstream. Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2. I observed that hugetlb does not support/expect write-faults in shared mappings that would have to map the R/O-mapped page writable -- and I found two case where we could currently get such faults and would erroneously map an anon page into a shared mapping. Reproducers part of the patches. I propose to backport both fixes to stable trees. The first fix needs a small adjustment. This patch (of 2): Staring at hugetlb_wp(), one might wonder where all the logic for shared mappings is when stumbling over a write-protected page in a shared mapping. In fact, there is none, and so far we thought we could get away with that because e.g., mprotect() should always do the right thing and map all pages directly writable. Looks like we were wrong: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static void clear_softdirty(void) { int fd = open("/proc/self/clear_refs", O_WRONLY); const char *ctrl = "4"; int ret; if (fd < 0) { fprintf(stderr, "open(clear_refs) failed\n"); exit(1); } ret = write(fd, ctrl, strlen(ctrl)); if (ret != strlen(ctrl)) { fprintf(stderr, "write(clear_refs) failed\n"); exit(1); } close(fd); } int main(int argc, char **argv) { char *map; int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (mprotect(map, HUGETLB_SIZE, PROT_READ)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } clear_softdirty(); if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason in this particular case is that vma_wants_writenotify() will return "true", removing VM_SHARED in vma_set_page_prot() to map pages write-protected. Let's teach vma_wants_writenotify() that hugetlb does not support softdirty tracking. Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31shmem: update folio if shmem_replace_page() updates the pageMatthew Wilcox (Oracle)1-0/+1
commit 9dfb3b8d655022760ca68af11821f1c63aa547c3 upstream. If we allocate a new page, we need to make sure that our folio matches that new page. If we do end up in this code path, we store the wrong page in the shmem inode's page cache, and I would rather imagine that data corruption ensues. This will be solved by changing shmem_replace_page() to shmem_replace_folio(), but this is the minimal fix. Link: https://lkml.kernel.org/r/20220730042518.1264767-1-willy@infradead.org Fixes: da08e9b79323 ("mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/mprotect: only reference swap pfn page if type matchPeter Xu1-1/+2
commit 3d2f78f08cd8388035ac375e731ec1ac1b79b09d upstream. Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]: kernel BUG at include/linux/swapops.h:117! CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L 6.0.0-dbg-DEV #2 RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0 Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6 c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b 48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48 RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282 RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000 RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000 R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738 R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> change_pte_range+0x36e/0x880 change_p4d_range+0x2e8/0x670 change_protection_range+0x14e/0x2c0 mprotect_fixup+0x1ee/0x330 do_mprotect_pkey+0x34c/0x440 __x64_sys_mprotect+0x1d/0x30 It triggers because pfn_swap_entry_to_page() could be called upon e.g. a genuine swap entry. Fix it by only calling it when it's a write migration entry where the page* is used. [1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Yu Zhao <yuzhao@google.com> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pteMiaohe Lin1-1/+1
commit ab74ef708dc51df7cf2b8a890b9c6990fac5c0c6 upstream. In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page cache are installed in the ptes. But hugepage_add_new_anon_rmap is called for them mistakenly because they're not vm_shared. This will corrupt the page->mapping used by page cache code. Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31bootmem: remove the vmemmap pages from kmemleak in put_page_bootmemLiu Shixin1-0/+2
commit dd0ff4d12dd284c334f7e9b07f8f335af856ac78 upstream. The vmemmap pages is marked by kmemleak when allocated from memblock. Remove it from kmemleak when freeing the page. Otherwise, when we reuse the page, kmemleak may report such an error and then stop working. kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing) kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff98fb6be00000 (size 335544320): kmemleak: comm "swapper", pid 0, jiffies 4294892296 kmemleak: min_count = 0 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page) Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/damon/dbgfs: avoid duplicate context directory creationBadari Pulavarty1-0/+3
commit d26f60703606ab425eee9882b32a1781a8bed74d upstream. When user tries to create a DAMON context via the DAMON debugfs interface with a name of an already existing context, the context directory creation fails but a new context is created and added in the internal data structure, due to absence of the directory creation success check. As a result, memory could leak and DAMON cannot be turned on. An example test case is as below: # cd /sys/kernel/debug/damon/ # echo "off" > monitor_on # echo paddr > target_ids # echo "abc" > mk_context # echo "abc" > mk_context # echo $$ > abc/target_ids # echo "on" > monitor_on <<< fails Return value of 'debugfs_create_dir()' is expected to be ignored in general, but this is an exceptional case as DAMON feature is depending on the debugfs functionality and it has the potential duplicate name issue. This commit therefore fixes the issue by checking the directory creation failure and immediately return the error in the case. Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: Badari Pulavarty <badari.pulavarty@intel.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [ 5.15.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31writeback: avoid use-after-free after removing deviceKhazhismel Kumykov2-6/+10
commit f87904c075515f3e1d8f4a7115869d3b914674fd upstream. When a disk is removed, bdi_unregister gets called to stop further writeback and wait for associated delayed work to complete. However, wb_inode_writeback_end() may schedule bandwidth estimation dwork after this has completed, which can result in the timer attempting to access the just freed bdi_writeback. Fix this by checking if the bdi_writeback is alive, similar to when scheduling writeback work. Since this requires wb->work_lock, and wb_inode_writeback_end() may get called from interrupt, switch wb->work_lock to an irqsafe lock. Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/hugetlb: support write-faults in shared mappingsDavid Hildenbrand1-7/+19
commit 1d8d14641fd94a01b20a4abbf2749fd8eddcf57b upstream. If we ever get a write-fault on a write-protected page in a shared mapping, we'd be in trouble (again). Instead, we can simply map the page writable. And in fact, there is even a way right now to trigger that code via uffd-wp ever since we stared to support it for shmem in 5.19: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static char *map; int uffd; static int temp_setup_uffd(void) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; struct uffdio_writeprotect uffd_writeprotect; struct uffdio_range uffd_range; uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; } uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; } /* Register UFFD-WP */ uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = HUGETLB_SIZE; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; } /* Writeprotect a single page. */ uffd_writeprotect.range.start = (unsigned long) map; uffd_writeprotect.range.len = HUGETLB_SIZE; uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); return -errno; } /* Unregister UFFD-WP without prior writeunprotection. */ uffd_range.start = (unsigned long) map; uffd_range.len = HUGETLB_SIZE; if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) { fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno); return -errno; } return 0; } int main(int argc, char **argv) { int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (temp_setup_uffd()) return 1; *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when unregistering and consequently keeps the PTE writeprotected. Reason for this is to avoid the additional overhead when unregistering. Note that this is the case also for !hugetlb and that we will end up with writable PTEs that still have the uffd-wp PTE bit set once we return from hugetlb_wp(). I'm not touching the uffd-wp PTE bit for now, because it seems to be a generic thing -- wp_page_reuse() also doesn't clear it. VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates that MAP_SHARED handling was at least envisioned, but could never have worked as expected. While at it, make sure that we never end up in hugetlb_wp() on write faults without VM_WRITE, because we don't support maybe_mkwrite() semantics as commonly used in the !hugetlb case -- for example, in wp_page_reuse(). Note that there is no need to do any kind of reservation in hugetlb_fault() in this case ... because we already have a hugetlb page mapped R/O that we will simply map writable and we are not dealing with COW/unsharing. Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [5.19] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/uffd: reset write protection when unregister with wp-modePeter Xu1-11/+18
commit f369b07c861435bd812a9d14493f71b34132ed6f upstream. The motivation of this patch comes from a recent report and patchfix from David Hildenbrand on hugetlb shared handling of wr-protected page [1]. With the reproducer provided in commit message of [1], one can leverage the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect not only the attacker process, but also the whole system. The lazy-reset mechanism of uffd-wp was used to make unregister faster, meanwhile it has an assumption that any leftover pgtable entries should only affect the process on its own, so not only the user should be aware of anything it does, but also it should not affect outside of the process. But it seems that this is not true, and it can also be utilized to make some exploit easier. So far there's no clue showing that the lazy-reset is important to any userfaultfd users because normally the unregister will only happen once for a specific range of memory of the lifecycle of the process. Considering all above, what this patch proposes is to do explicit pte resets when unregister an uffd region with wr-protect mode enabled. It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false) right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll make the unregister slower. From that pov it's a very slight abi change, but hopefully nothing should break with this change either. Regarding to the change itself - core of uffd write [un]protect operation is moved into a separate function (uffd_wp_range()) and it is reused in the unregister code path. Note that the new function will not check for anything, e.g. ranges or memory types, because they should have been checked during the previous UFFDIO_REGISTER or it should have failed already. It also doesn't check mmap_changing because we're with mmap write lock held anyway. I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's the only issue reported so far and that's the commit David's reproducer will start working (v5.19+). But the whole idea actually applies to not only file memories but also anonymous. It's just that we don't need to fix anonymous prior to v5.19- because there's no known way to exploit. IOW, this patch can also fix the issue reported in [1] as the patch 2 does. [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/ Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COWDavid Hildenbrand2-43/+91
commit 5535be3099717646781ce1540cf725965d680e7b upstream. Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space. Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone. And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file, and to bypass memfd write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened. Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW. Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually. For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA. This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it. Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable. Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way. Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled(). Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.16] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21Revert "mm: kfence: apply kmemleak_ignore_phys on early allocated pool"Marco Elver1-9/+9
This reverts commit 07313a2b29ed1079eaa7722624544b97b3ead84b. Commit 0c24e061196c21d5 ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") is not yet in 5.19 (but appears in 6.0). Without 0c24e061196c21d5, kmemleak still stores phys objects and non-phys objects in the same tree, and ignoring (instead of freeing) will cause insertions into the kmemleak object tree by the slab post-alloc hook to conflict with the pool object (see comment). Reports such as the following would appear on boot, and effectively disable kmemleak: | kmemleak: Cannot insert 0xffffff806e24f000 into the object search tree (overlaps existing) | CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.19.0-v8-0815+ #5 | Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT) | Call trace: | dump_backtrace.part.0+0x1dc/0x1ec | show_stack+0x24/0x80 | dump_stack_lvl+0x8c/0xb8 | dump_stack+0x1c/0x38 | create_object.isra.0+0x490/0x4b0 | kmemleak_alloc+0x3c/0x50 | kmem_cache_alloc+0x2f8/0x450 | __proc_create+0x18c/0x400 | proc_create_reg+0x54/0xd0 | proc_create_seq_private+0x94/0x120 | init_mm_internals+0x1d8/0x248 | kernel_init_freeable+0x188/0x388 | kernel_init+0x30/0x150 | ret_from_fork+0x10/0x20 | kmemleak: Kernel memory leak detector disabled | kmemleak: Object 0xffffff806e24d000 (size 2097152): | kmemleak: comm "swapper", pid 0, jiffies 4294892296 | kmemleak: min_count = -1 | kmemleak: count = 0 | kmemleak: flags = 0x5 | kmemleak: checksum = 0 | kmemleak: backtrace: | kmemleak_alloc_phys+0x94/0xb0 | memblock_alloc_range_nid+0x1c0/0x20c | memblock_alloc_internal+0x88/0x100 | memblock_alloc_try_nid+0x148/0x1ac | kfence_alloc_pool+0x44/0x6c | mm_init+0x28/0x98 | start_kernel+0x178/0x3e8 | __primary_switched+0xc4/0xcc Reported-by: Max Schulze <max.schulze@online.de> Signed-off-by: Marco Elver <elver@google.com> Link: https://lore.kernel.org/all/b33b33bc-2d06-1bcd-2df7-43678962b728@online.de/ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17hugetlb_cgroup: fix wrong hugetlb cgroup numa statMiaohe Lin1-0/+1
[ Upstream commit 2727cfe4072a35ce813e3708f74c135de7da8897 ] We forget to set cft->private for numa stat file. As a result, numa stat of hstates[0] is always showed for all hstates. Encode the hstates index into cft->private to fix this issue. Link: https://lkml.kernel.org/r/20220723073804.53035-1-linmiaohe@huawei.com Fixes: f47761999052 ("hugetlb: add hugetlb.*.numa_stat file") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm/damon/reclaim: fix potential memory leak in damon_reclaim_init()Jianglei Nie1-1/+3
[ Upstream commit 188043c7f4f2bd662f2a55957d684fffa543e600 ] damon_reclaim_init() allocates a memory chunk for ctx with damon_new_ctx(). When damon_select_ops() fails, ctx is not released, which will lead to a memory leak. We should release the ctx with damon_destroy_ctx() when damon_select_ops() fails to fix the memory leak. Link: https://lkml.kernel.org/r/20220714063746.2343549-1-niejianglei2021@163.com Fixes: 4d69c3457821 ("mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()") Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm/mmap.c: fix missing call to vm_unacct_memory in mmap_regionMiaohe Lin1-1/+0
[ Upstream commit 7f82f922319ede486540e8746769865b9508d2c2 ] Since the beginning, charged is set to 0 to avoid calling vm_unacct_memory twice because vm_unacct_memory will be called by above unmap_region. But since commit 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces"), unmap_region doesn't call vm_unacct_memory anymore. So charged shouldn't be set to 0 now otherwise the calling to paired vm_unacct_memory will be missed and leads to imbalanced account. Link: https://lkml.kernel.org/r/20220618082027.43391-1-linmiaohe@huawei.com Fixes: 4f74d2c8e827 ("vm: remove 'nr_accounted' calculations from the unmap_vmas() interfaces") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm: percpu: use kmemleak_ignore_phys() instead of kmemleak_free()Patrick Wang1-3/+3
[ Upstream commit a317ebccaa3609917a2c021af870cf3fa607ab0c ] Kmemleak recently added a rbtree to store the objects allocted with physical address. Those objects can't be freed with kmemleak_free(). According to the comments, percpu allocations are tracked by kmemleak separately. Kmemleak_free() was used to avoid the unnecessary tracking. If kmemleak_free() fails, those objects would be scanned by kmemleak, which is unnecessary but shouldn't lead to other effects. Use kmemleak_ignore_phys() instead of kmemleak_free() for those objects. Link: https://lkml.kernel.org/r/20220705113158.127600-1-patrick.wang.shcn@gmail.com Fixes: 0c24e061196c ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Patrick Wang <patrick.wang.shcn@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm/mempolicy: fix get_nodes out of bound accessTianyu Li1-1/+1
[ Upstream commit 000eca5d044d1ee23b4ca311793cf3fc528da6c6 ] When user specified more nodes than supported, get_nodes will access nmask array out of bounds. Link: https://lkml.kernel.org/r/20220601093211.2970565-1-tianyu.li@arm.com Fixes: e130242dc351 ("mm: simplify compat numa syscalls") Signed-off-by: Tianyu Li <tianyu.li@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17kasan: fix zeroing vmalloc memory with HW_TAGSAndrey Konovalov2-14/+28
[ Upstream commit 6c2f761dad7851d8088b91063ccaea3c970efe78 ] HW_TAGS KASAN skips zeroing page_alloc allocations backing vmalloc mappings via __GFP_SKIP_ZERO. Instead, these pages are zeroed via kasan_unpoison_vmalloc() by passing the KASAN_VMALLOC_INIT flag. The problem is that __kasan_unpoison_vmalloc() does not zero pages when either kasan_vmalloc_enabled() or is_vmalloc_or_module_addr() fail. Thus: 1. Change __vmalloc_node_range() to only set KASAN_VMALLOC_INIT when __GFP_SKIP_ZERO is set. 2. Change __kasan_unpoison_vmalloc() to always zero pages when the KASAN_VMALLOC_INIT flag is set. 3. Add WARN_ON() asserts to check that KASAN_VMALLOC_INIT cannot be set in other early return paths of __kasan_unpoison_vmalloc(). Also clean up the comment in __kasan_unpoison_vmalloc. Link: https://lkml.kernel.org/r/4bc503537efdc539ffc3f461c1b70162eea31cf6.1654798516.git.andreyknvl@google.com Fixes: 23689e91fb22 ("kasan, vmalloc: add vmalloc tagging for HW_TAGS") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm: introduce clear_highpage_kasan_taggedAndrey Konovalov1-6/+2
commit d9da8f6cf55eeca642c021912af1890002464c64 upstream. Add a clear_highpage_kasan_tagged() helper that does clear_highpage() on a page potentially tagged by KASAN. This helper is used by the following patch. Link: https://lkml.kernel.org/r/4471979b46b2c487787ddcd08b9dc5fedd1b6ffd.1654798516.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-17mm/migration: fix potential pte_unmap on an not mapped pteMiaohe Lin2-6/+21
[ Upstream commit ad1ac596e8a8c4b06715dfbd89853eb73c9886b2 ] __migration_entry_wait and migration_entry_wait_on_locked assume pte is always mapped from caller. But this is not the case when it's called from migration_entry_wait_huge and follow_huge_pmd. Add a hugetlbfs variant that calls hugetlb_migration_entry_wait(ptep == NULL) to fix this issue. Link: https://lkml.kernel.org/r/20220530113016.16663-5-linmiaohe@huawei.com Fixes: 30dad30922cc ("mm: migration: add migrate_entry_wait_huge()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: kernel test robot <lkp@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm/migration: return errno when isolate_huge_page failedMiaohe Lin6-13/+13
[ Upstream commit 7ce82f4c3f3ead13a9d9498768e3b1a79975c4d8 ] We might fail to isolate huge page due to e.g. the page is under migration which cleared HPageMigratable. We should return errno in this case rather than always return 1 which could confuse the user, i.e. the caller might think all of the memory is migrated while the hugetlb page is left behind. We make the prototype of isolate_huge_page consistent with isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page to isolate_hugetlb as suggested by Muchun to improve the readability. Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com Fixes: e8db67eb0ded ("mm: migrate: move_pages() supports thp migration") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Huang Ying <ying.huang@intel.com> Reported-by: kernel test robot <lkp@intel.com> (build error) Cc: Alistair Popple <apopple@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm/memremap: fix memunmap_pages() race with get_dev_pagemap()Miaohe Lin1-1/+1
[ Upstream commit 1e57ffb6e3fd9583268c6462c4e3853575b21701 ] Think about the below scene: CPU1 CPU2 memunmap_pages percpu_ref_exit __percpu_ref_exit free_percpu(percpu_count); /* percpu_count is freed here! */ get_dev_pagemap xa_load(&pgmap_array, PHYS_PFN(phys)) /* pgmap still in the pgmap_array */ percpu_ref_tryget_live(&pgmap->ref) if __ref_is_percpu /* __PERCPU_REF_ATOMIC_DEAD not set yet */ this_cpu_inc(*percpu_count) /* access freed percpu_count here! */ ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD; /* too late... */ pageunmap_range To fix the issue, do percpu_ref_exit() after pgmap_array is emptied. So we won't do percpu_ref_tryget_live() against a being freed percpu_ref. Link: https://lkml.kernel.org/r/20220609121305.2508-1-linmiaohe@huawei.com Fixes: b7b3c01b1915 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-17mm: Account dirty folios properly during splitsMatthew Wilcox (Oracle)1-3/+8
[ Upstream commit fb5c2029f8221e904e604938171c4a8ef169aadb ] If the last folio in a file is split as a result of truncation, we simply clear the dirty bits for the pages we're discarding. That causes NR_FILE_DIRTY (among other counters) to be thrown off and eventually Linux will hang in balance_dirty_pages_ratelimited() Reported-by: Dave Chinner <dchinner@redhat.com> Tested-by: Dave Chinner <dchinner@redhat.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Fixes: d68eccad3706 ("mm/filemap: Allow large folios to be added to the page cache") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-30Merge tag 'mm-hotfixes-stable-2022-07-29' of ↵Linus Torvalds2-15/+16
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Two hotfixes, both cc:stable" * tag 'mm-hotfixes-stable-2022-07-29' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/hmm: fault non-owner device private entries page_alloc: fix invalid watermark check on a negative value
2022-07-29mm/hmm: fault non-owner device private entriesRalph Campbell1-11/+8
If hmm_range_fault() is called with the HMM_PFN_REQ_FAULT flag and a device private PTE is found, the hmm_range::dev_private_owner page is used to determine if the device private page should not be faulted in. However, if the device private page is not owned by the caller, hmm_range_fault() returns an error instead of calling migrate_to_ram() to fault in the page. For example, if a page is migrated to GPU private memory and a RDMA fault capable NIC tries to read the migrated page, without this patch it will get an error. With this patch, the page will be migrated back to system memory and the NIC will be able to read the data. Link: https://lkml.kernel.org/r/20220727000837.4128709-2-rcampbell@nvidia.com Link: https://lkml.kernel.org/r/20220725183615.4118795-2-rcampbell@nvidia.com Fixes: 08ddddda667b ("mm/hmm: check the device private page owner in hmm_range_fault()") Signed-off-by: Ralph Campbell <rcampbell@nvidia.com> Reported-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-29page_alloc: fix invalid watermark check on a negative valueJaewon Kim1-4/+8
There was a report that a task is waiting at the throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was increasing. This is a bug where zone_watermark_fast returns true even when the free is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") changed the watermark fast to consider highatomic reserve. But it did not handle a negative value case which can be happened when reserved_highatomic pageblock is bigger than the actual free. If watermark is considered as ok for the negative value, allocating contexts for order-0 will consume all free pages without direct reclaim, and finally free page may become depleted except highatomic free. Then allocating contexts may fall into throttle_direct_reclaim. This symptom may easily happen in a system where wmark min is low and other reclaimers like kswapd does not make free pages quickly. Handle the negative case by using MIN. Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reported-by: GyeongHwan Hong <gh21.hong@samsung.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yong-Taek Lee <ytk.lee@samsung.com> Cc: <stable@vger.kerenl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-27Merge tag 'mm-hotfixes-stable-2022-07-26' of ↵Linus Torvalds7-30/+57
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Thirteen hotfixes. Eight are cc:stable and the remainder are for post-5.18 issues or are too minor to warrant backporting" * tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: update Gao Xiang's email addresses userfaultfd: provide properly masked address for huge-pages Revert "ocfs2: mount shared volume without ha stack" hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte fs: sendfile handles O_NONBLOCK of out_fd ntfs: fix use-after-free in ntfs_ucsncmp() secretmem: fix unhandled fault in truncate mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range() mm: fix missing wake-up event for FSDAX pages mm: fix page leak with multiple threads mapping the same page mailmap: update Seth Forshee's email address tmpfs: fix the issue that the mount and remount results are inconsistent. mm: kfence: apply kmemleak_ignore_phys on early allocated pool
2022-07-26mm: fix NULL pointer dereference in wp_page_reuse()Qi Zheng1-1/+1
The vmf->page can be NULL when the wp_page_reuse() is invoked by wp_pfn_shared(), it will cause the following panic: BUG: kernel NULL pointer dereference, address: 000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 18 PID: 923 Comm: Xorg Not tainted 5.19.0-rc8.bm.1-amd64 #263 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g14 RIP: 0010:_compound_head+0x0/0x40 [...] Call Trace: wp_page_reuse+0x1c/0xa0 do_wp_page+0x1a5/0x3f0 __handle_mm_fault+0x8cf/0xd20 handle_mm_fault+0xd5/0x2a0 do_user_addr_fault+0x1d0/0x680 exc_page_fault+0x78/0x170 asm_exc_page_fault+0x22/0x30 To fix it, this patch performs a NULL pointer check before dereferencing the vmf->page. Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-07-19hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pteMiaohe Lin1-0/+1
When alloc_huge_page fails, *pagep is set to NULL without put_page first. So the hugepage indicated by *pagep is leaked. Link: https://lkml.kernel.org/r/20220709092629.54291-1-linmiaohe@huawei.com Fixes: 8cc5fcbb5be8 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19secretmem: fix unhandled fault in truncateMike Rapoport1-7/+26
syzkaller reports the following issue: BUG: unable to handle page fault for address: ffff888021f7e005 PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060 Oops: 0002 [#1] PREEMPT SMP KASAN CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64 Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01 RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005 RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005 R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005 R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> zero_user_segments include/linux/highmem.h:272 [inline] folio_zero_range include/linux/highmem.h:428 [inline] truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237 truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381 truncate_inode_pages mm/truncate.c:452 [inline] truncate_pagecache+0x63/0x90 mm/truncate.c:753 simple_setattr+0xed/0x110 fs/libfs.c:535 secretmem_setattr+0xae/0xf0 mm/secretmem.c:170 notify_change+0xb8c/0x12b0 fs/attr.c:424 do_truncate+0x13c/0x200 fs/open.c:65 do_sys_ftruncate+0x536/0x730 fs/open.c:193 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7fb29d900899 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899 RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003 RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000 </TASK> Modules linked in: CR2: ffff888021f7e005 ---[ end trace 0000000000000000 ]--- Eric Biggers suggested that this happens when secretmem_setattr()->simple_setattr() races with secretmem_fault() so that a page that is faulted in by secretmem_fault() (and thus removed from the direct map) is zeroed by inode truncation right afterwards. Use mapping->invalidate_lock to make secretmem_fault() and secretmem_setattr() mutually exclusive. [rppt@linux.ibm.com: v3] Link: https://lkml.kernel.org/r/20220714091337.412297-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20220707165650.248088-1-rppt@kernel.org Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Suggested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()Naoya Horiguchi1-2/+7
Originally copy_hugetlb_page_range() handles migration entries and hwpoisoned entries in similar manner. But recently the related code path has more code for migration entries, and when is_writable_migration_entry() was converted to !is_readable_migration_entry(), hwpoison entries on source processes got to be unexpectedly updated (which is legitimate for migration entries, but not for hwpoison entries). This results in unexpected serious issues like kernel panic when forking processes with hwpoison entries in pmd. Separate the if branch into one for hwpoison entries and one for migration entries. Link: https://lkml.kernel.org/r/20220704013312.2415700-3-naoya.horiguchi@linux.dev Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> [5.18] Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19mm: fix missing wake-up event for FSDAX pagesMuchun Song2-5/+7
FSDAX page refcounts are 1-based, rather than 0-based: if refcount is 1, then the page is freed. The FSDAX pages can be pinned through GUP, then they will be unpinned via unpin_user_page() using a folio variant to put the page, however, folio variants did not consider this special case, the result will be to miss a wakeup event (like the user of __fuse_dax_break_layouts()). This results in a task being permanently stuck in TASK_INTERRUPTIBLE state. Since FSDAX pages are only possibly obtained by GUP users, so fix GUP instead of folio_put() to lower overhead. Link: https://lkml.kernel.org/r/20220705123532.283-1-songmuchun@bytedance.com Fixes: d8ddc099c6b3 ("mm/gup: Add gup_put_folio()") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19mm: fix page leak with multiple threads mapping the same pageJosef Bacik1-2/+5
We have an application with a lot of threads that use a shared mmap backed by tmpfs mounted with -o huge=within_size. This application started leaking loads of huge pages when we upgraded to a recent kernel. Using the page ref tracepoints and a BPF program written by Tejun Heo we were able to determine that these pages would have multiple refcounts from the page fault path, but when it came to unmap time we wouldn't drop the number of refs we had added from the faults. I wrote a reproducer that mmap'ed a file backed by tmpfs with -o huge=always, and then spawned 20 threads all looping faulting random offsets in this map, while using madvise(MADV_DONTNEED) randomly for huge page aligned ranges. This very quickly reproduced the problem. The problem here is that we check for the case that we have multiple threads faulting in a range that was previously unmapped. One thread maps the PMD, the other thread loses the race and then returns 0. However at this point we already have the page, and we are no longer putting this page into the processes address space, and so we leak the page. We actually did the correct thing prior to f9ce0be71d1f, however it looks like Kirill copied what we do in the anonymous page case. In the anonymous page case we don't yet have a page, so we don't have to drop a reference on anything. Previously we did the correct thing for file based faults by returning VM_FAULT_NOPAGE so we correctly drop the reference on the page we faulted in. Fix this by returning VM_FAULT_NOPAGE in the pmd_devmap_trans_unstable() case, this makes us drop the ref on the page properly, and now my reproducer no longer leaks the huge pages. [josef@toxicpanda.com: v2] Link: https://lkml.kernel.org/r/e90c8f0dbae836632b669c2afc434006a00d4a67.1657721478.git.josef@toxicpanda.com Link: https://lkml.kernel.org/r/2b798acfd95c9ab9395fe85e8d5a835e2e10a920.1657051137.git.josef@toxicpanda.com Fixes: f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19tmpfs: fix the issue that the mount and remount results are inconsistent.ZhaoLong Wang1-5/+2
An undefined-behavior issue has not been completely fixed since commit d14f5efadd84 ("tmpfs: fix undefined-behaviour in shmem_reconfigure()"). In the commit, check in the shmem_reconfigure() is added in remount process to avoid the Ubsan problem. However, the check is not added to the mount process. It causes inconsistent results between mount and remount. The operations to reproduce the problem in user mode as follows: If nr_blocks is set to 0x8000000000000000, the mounting is successful. # mount tmpfs /dev/shm/ -t tmpfs -o nr_blocks=0x8000000000000000 However, when -o remount is used, the mount fails because of the check in the shmem_reconfigure() # mount tmpfs /dev/shm/ -t tmpfs -o remount,nr_blocks=0x8000000000000000 mount: /dev/shm: mount point not mounted or bad option. Therefore, add checks in the shmem_parse_one() function and remove the check in shmem_reconfigure() to avoid this problem. Link: https://lkml.kernel.org/r/20220629124324.1640807-1-wangzhaolong1@huawei.com Signed-off-by: ZhaoLong Wang <wangzhaolong1@huawei.com> Cc: Luo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Zhihao Cheng <chengzhihao1@huawei.com> Cc: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-19mm: kfence: apply kmemleak_ignore_phys on early allocated poolYee Lee1-9/+9
This patch solves two issues. (1) The pool allocated by memblock needs to unregister from kmemleak scanning. Apply kmemleak_ignore_phys to replace the original kmemleak_free as its address now is stored in the phys tree. (2) The pool late allocated by page-alloc doesn't need to unregister. Move out the freeing operation from its call path. Link: https://lkml.kernel.org/r/20220628113714.7792-2-yee.lee@mediatek.com Fixes: 0c24e061196c21d5 ("mm: kmemleak: add rbtree and store physical address for objects allocated with PA") Signed-off-by: Yee Lee <yee.lee@mediatek.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-04mm: split huge PUD on wp_huge_pud fallbackGowans, James1-13/+14
Currently the implementation will split the PUD when a fallback is taken inside the create_huge_pud function. This isn't where it should be done: the splitting should be done in wp_huge_pud, just like it's done for PMDs. Reason being that if a callback is taken during create, there is no PUD yet so nothing to split, whereas if a fallback is taken when encountering a write protection fault there is something to split. It looks like this was the original intention with the commit where the splitting was introduced, but somehow it got moved to the wrong place between v1 and v2 of the patch series. Rebase mistake perhaps. Link: https://lkml.kernel.org/r/6f48d622eb8bce1ae5dd75327b0b73894a2ec407.camel@amazon.com Fixes: 327e9fd48972 ("mm: Split huge pages on write-notify or COW") Signed-off-by: James Gowans <jgowans@amazon.com> Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-04mm/rmap: fix dereferencing invalid subpage pointer in try_to_migrate_one()David Hildenbrand1-10/+17
The subpage we calculate is an invalid pointer for device private pages, because device private pages are mapped via non-present device private entries, not ordinary present PTEs. Let's just not compute broken pointers and fixup later. Move the proper assignment of the correct subpage to the beginning of the function and assert that we really only have a single page in our folio. This currently results in a BUG when tying to compute anon_exclusive, because: [ 528.727237] BUG: unable to handle page fault for address: ffffea1fffffffc0 [ 528.739585] #PF: supervisor read access in kernel mode [ 528.745324] #PF: error_code(0x0000) - not-present page [ 528.751062] PGD 44eaf2067 P4D 44eaf2067 PUD 0 [ 528.756026] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 528.760890] CPU: 120 PID: 18275 Comm: hmm-tests Not tainted 5.19.0-rc3-kfd-alex #257 [ 528.769542] Hardware name: AMD Corporation BardPeak/BardPeak, BIOS RTY1002BDS 09/17/2021 [ 528.778579] RIP: 0010:try_to_migrate_one+0x21a/0x1000 [ 528.784225] Code: f6 48 89 c8 48 2b 05 45 d1 6a 01 48 c1 f8 06 48 29 c3 48 8b 45 a8 48 c1 e3 06 48 01 cb f6 41 18 01 48 89 85 50 ff ff ff 74 0b <4c> 8b 33 49 c1 ee 11 41 83 e6 01 48 8b bd 48 ff ff ff e8 3f 99 02 [ 528.805194] RSP: 0000:ffffc90003cdfaa0 EFLAGS: 00010202 [ 528.811027] RAX: 00007ffff7ff4000 RBX: ffffea1fffffffc0 RCX: ffffeaffffffffc0 [ 528.818995] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90003cdfaf8 [ 528.826962] RBP: ffffc90003cdfb70 R08: 0000000000000000 R09: 0000000000000000 [ 528.834930] R10: ffffc90003cdf910 R11: 0000000000000002 R12: ffff888194450540 [ 528.842899] R13: ffff888160d057c0 R14: 0000000000000000 R15: 03ffffffffffffff [ 528.850865] FS: 00007ffff7fdb740(0000) GS:ffff8883b0600000(0000) knlGS:0000000000000000 [ 528.859891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 528.866308] CR2: ffffea1fffffffc0 CR3: 00000001562b4003 CR4: 0000000000770ee0 [ 528.874275] PKRU: 55555554 [ 528.877286] Call Trace: [ 528.880016] <TASK> [ 528.882356] ? lock_is_held_type+0xdf/0x130 [ 528.887033] rmap_walk_anon+0x167/0x410 [ 528.891316] try_to_migrate+0x90/0xd0 [ 528.895405] ? try_to_unmap_one+0xe10/0xe10 [ 528.900074] ? anon_vma_ctor+0x50/0x50 [ 528.904260] ? put_anon_vma+0x10/0x10 [ 528.908347] ? invalid_mkclean_vma+0x20/0x20 [ 528.913114] migrate_vma_setup+0x5f4/0x750 [ 528.917691] dmirror_devmem_fault+0x8c/0x250 [test_hmm] [ 528.923532] do_swap_page+0xac0/0xe50 [ 528.927623] ? __lock_acquire+0x4b2/0x1ac0 [ 528.932199] __handle_mm_fault+0x949/0x1440 [ 528.936876] handle_mm_fault+0x13f/0x3e0 [ 528.941256] do_user_addr_fault+0x215/0x740 [ 528.945928] exc_page_fault+0x75/0x280 [ 528.950115] asm_exc_page_fault+0x27/0x30 [ 528.954593] RIP: 0033:0x40366b ... Link: https://lkml.kernel.org/r/20220623205332.319257-1-david@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: "Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Tested-by: Alistair Popple <apopple@nvidia.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@lst.de> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-04mm: sparsemem: fix missing higher order allocation splittingMuchun Song1-0/+8
Higher order allocations for vmemmap pages from buddy allocator must be able to be treated as indepdenent small pages as they can be freed individually by the caller. There is no problem for higher order vmemmap pages allocated at boot time since each individual small page will be initialized at boot time. However, it will be an issue for memory hotplug case since those higher order vmemmap pages are allocated from buddy allocator without initializing each individual small page's refcount. The system will panic in put_page_testzero() when CONFIG_DEBUG_VM is enabled if the vmemmap page is freed. Link: https://lkml.kernel.org/r/20220620023019.94257-1-songmuchun@bytedance.com Fixes: d8d55f5616cf ("mm: sparsemem: use page table lock to protect kernel pmd operations") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-04mm/damon: use set_huge_pte_at() to make huge pte oldBaolin Wang1-2/+1
The huge_ptep_set_access_flags() can not make the huge pte old according to the discussion [1], that means we will always mornitor the young state of the hugetlb though we stopped accessing the hugetlb, as a result DAMON will get inaccurate accessing statistics. So changing to use set_huge_pte_at() to make the huge pte old to fix this issue. [1] https://lore.kernel.org/all/Yqy97gXI4Nqb7dYo@arm.com/ Link: https://lkml.kernel.org/r/1655692482-28797-1-git-send-email-baolin.wang@linux.alibaba.com Fixes: 49f4203aae06 ("mm/damon: add access checking for hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-07-04mm: userfaultfd: fix UFFDIO_CONTINUE on fallocated shmem pagesAxel Rasmussen1-1/+4
When fallocate() is used on a shmem file, the pages we allocate can end up with !PageUptodate. Since UFFDIO_CONTINUE tries to find the existing page the user wants to map with SGP_READ, we would fail to find such a page, since shmem_getpage_gfp returns with a "NULL" pagep for SGP_READ if it discovers !PageUptodate. As a result, UFFDIO_CONTINUE returns -EFAULT, as it would do if the page wasn't found in the page cache at all. This isn't the intended behavior. UFFDIO_CONTINUE is just trying to find if a page exists, and doesn't care whether it still needs to be cleared or not. So, instead of SGP_READ, pass in SGP_NOALLOC. This is the same, except for one critical difference: in the !PageUptodate case, SGP_NOALLOC will clear the page and then return it. With this change, UFFDIO_CONTINUE works properly (succeeds) on a shmem file which has been fallocated, but otherwise not modified. Link: https://lkml.kernel.org/r/20220610173812.1768919-1-axelrasmussen@google.com Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") Signed-off-by: Axel Rasmussen <axelrasmussen@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-06-27Merge tag 'mm-hotfixes-stable-2022-06-26' of ↵Linus Torvalds8-6/+31
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Minor things, mainly - mailmap updates, MAINTAINERS updates, etc. Fixes for this merge window: - fix for a damon boot hang, from SeongJae - fix for a kfence warning splat, from Jason Donenfeld - fix for zero-pfn pinning, from Alex Williamson - fix for fallocate hole punch clearing, from Mike Kravetz Fixes for previous releases: - fix for a performance regression, from Marcelo - fix for a hwpoisining BUG from zhenwei pi" * tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mailmap: add entry for Christian Marangi mm/memory-failure: disable unpoison once hw error happens hugetlbfs: zero partial pages during fallocate hole punch mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py mm: re-allow pinning of zero pfns mm/kfence: select random number before taking raw lock MAINTAINERS: add maillist information for LoongArch MAINTAINERS: update MM tree references MAINTAINERS: update Abel Vesa's email MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer MAINTAINERS: add Miaohe Lin as a memory-failure reviewer mailmap: add alias for jarkko@profian.com mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal mm: lru_cache_disable: use synchronize_rcu_expedited mm/page_isolation.c: fix one kernel-doc comment
2022-06-23filemap: Fix serialization adding transparent huge pages to page cacheAlistair Popple1-0/+2
Commit 793917d997df ("mm/readahead: Add large folio readahead") introduced support for using large folios for filebacked pages if the filesystem supports it. page_cache_ra_order() was introduced to allocate and add these large folios to the page cache. However adding pages to the page cache should be serialized against truncation and hole punching by taking invalidate_lock. Not doing so can lead to data races resulting in stale data getting added to the page cache and marked up-to-date. See commit 730633f0b7f9 ("mm: Protect operations adding pages to page cache with invalidate_lock") for more details. This issue was found by inspection but a testcase revealed it was possible to observe in practice on XFS. Fix this by taking invalidate_lock in page_cache_ra_order(), to mirror what is done for the non-thp case in page_cache_ra_unbounded(). Signed-off-by: Alistair Popple <apopple@nvidia.com> Fixes: 793917d997df ("mm/readahead: Add large folio readahead") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-23mm: Clear page->private when splitting or migrating a pageMatthew Wilcox (Oracle)2-0/+2
In our efforts to remove uses of PG_private, we have found folios with the private flag clear and folio->private not-NULL. That is the root cause behind 642d51fb0775 ("ceph: check folio PG_private bit instead of folio->private"). It can also affect a few other filesystems that haven't yet reported a problem. compaction_alloc() can return a page with uninitialised page->private, and rather than checking all the callers of migrate_pages(), just zero page->private after calling get_new_page(). Similarly, the tail pages from split_huge_page() may also have an uninitialised page->private. Reported-by: Xiubo Li <xiubli@redhat.com> Tested-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20filemap: Handle sibling entries in filemap_get_read_batch()Matthew Wilcox (Oracle)1-0/+2
If a read races with an invalidation followed by another read, it is possible for a folio to be replaced with a higher-order folio. If that happens, we'll see a sibling entry for the new folio in the next iteration of the loop. This manifests as a NULL pointer dereference while holding the RCU read lock. Handle this by simply returning. The next call will find the new folio and handle it correctly. The other ways of handling this rare race are more complex and it's just not worth it. Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Brian Foster <bfoster@redhat.com> Debugged-by: Brian Foster <bfoster@redhat.com> Tested-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read") Cc: stable@vger.kernel.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20filemap: Correct the conditions for marking a folio as accessedMatthew Wilcox (Oracle)1-3/+10
We had an off-by-one error which meant that we never marked the first page in a read as accessed. This was visible as a slowdown when re-reading a file as pages were being evicted from cache too soon. In reviewing this code, we noticed a second bug where a multi-page folio would be marked as accessed multiple times when doing reads that were less than the size of the folio. Abstract the comparison of whether two file positions are in the same folio into a new function, fixing both of these bugs. Reported-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2022-06-20Merge tag 'slab-for-5.19-fixup' of ↵Linus Torvalds1-7/+36
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: - A slub fix for PREEMPT_RT locking semantics from Sebastian. - A slub fix for state corruption due to a possible race scenario from Jann. * tag 'slab-for-5.19-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: add missing TID updates on slab deactivation mm/slub: Move the stackdepot related allocation out of IRQ-off section.
2022-06-17Merge tag 'fs_for_v5.19-rc3' of ↵Linus Torvalds1-9/+2
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback and ext2 fixes from Jan Kara: "A fix for writeback bug which prevented machines with kdevtmpfs from booting and also one small ext2 bugfix in IO error handling" * tag 'fs_for_v5.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: init: Initialize noop_backing_dev_info early ext2: fix fs corruption when trying to remove a non-empty directory with IO error