summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2022-09-12mm/memory-failure: fix detection of memory_failure() handlersDan Williams1-1/+1
Some pagemap types, like MEMORY_DEVICE_GENERIC (device-dax) do not even have pagemap ops which results in crash signatures like this: BUG: kernel NULL pointer dereference, address: 0000000000000010 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 8000000205073067 P4D 8000000205073067 PUD 2062b3067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 22 PID: 4535 Comm: device-dax Tainted: G OE N 6.0.0-rc2+ #59 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:memory_failure+0x667/0xba0 [..] Call Trace: <TASK> ? _printk+0x58/0x73 do_madvise.part.0.cold+0xaf/0xc5 Check for ops before checking if the ops have a memory_failure() handler. Link: https://lkml.kernel.org/r/166153428781.2758201.1990616683438224741.stgit@dwillia2-xfh.jf.intel.com Fixes: 33a8f7f2b3a3 ("pagemap,pmem: introduce ->memory_failure()") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Shiyang Ruan <ruansy.fnst@fujitsu.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-12mm/page_alloc: fix race condition between build_all_zonelists and page ↵Mel Gorman1-10/+43
allocation Patrick Daly reported the following problem; NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation [0] - ZONE_MOVABLE [1] - ZONE_NORMAL [2] - NULL For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation removes the last page from ZONE_MOVABLE, build_all_zonelists() & build_zonerefs_node() will update node_zonelists as shown below. Only populated zones are added. NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation [0] - ZONE_NORMAL [1] - NULL [2] - NULL The race is simple -- page allocation could be in progress when a memory hot-remove operation triggers a zonelist rebuild that removes zones. The allocation request will still have a valid ac->preferred_zoneref that is now pointing to NULL and triggers an OOM kill. This problem probably always existed but may be slightly easier to trigger due to 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") which distinguishes between zones that are completely unpopulated versus zones that have valid pages not managed by the buddy allocator (e.g. reserved, memblock, ballooning etc). Memory hotplug had multiple stages with timing considerations around managed/present page updates, the zonelist rebuild and the zone span updates. As David Hildenbrand puts it memory offlining adjusts managed+present pages of the zone essentially in one go. If after the adjustments, the zone is no longer populated (present==0), we rebuild the zone lists. Once that's done, we try shrinking the zone (start+spanned pages) -- which results in zone_start_pfn == 0 if there are no more pages. That happens *after* rebuilding the zonelists via remove_pfn_range_from_zone(). The only requirement to fix the race is that a page allocation request identifies when a zonelist rebuild has happened since the allocation request started and no page has yet been allocated. Use a seqlock_t to track zonelist updates with a lockless read-side of the zonelist and protecting the rebuild and update of the counter with a spinlock. [akpm@linux-foundation.org: make zonelist_update_seq static] Link: https://lkml.kernel.org/r/20220824110900.vh674ltxmzb3proq@techsingularity.net Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Patrick Daly <quic_pdaly@quicinc.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09mm/slub: fix to return errno if kmalloc() failsChao Yu1-1/+4
In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON(); kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568 Cc: <stable@kernel.org> Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-07freezer,sched: Rewrite core freezer logicPeter Zijlstra1-2/+2
Rewrite the core freezer to behave better wrt thawing and be simpler in general. By replacing PF_FROZEN with TASK_FROZEN, a special block state, it is ensured frozen tasks stay frozen until thawed and don't randomly wake up early, as is currently possible. As such, it does away with PF_FROZEN and PF_FREEZER_SKIP, freeing up two PF_flags (yay!). Specifically; the current scheme works a little like: freezer_do_not_count(); schedule(); freezer_count(); And either the task is blocked, or it lands in try_to_freezer() through freezer_count(). Now, when it is blocked, the freezer considers it frozen and continues. However, on thawing, once pm_freezing is cleared, freezer_count() stops working, and any random/spurious wakeup will let a task run before its time. That is, thawing tries to thaw things in explicit order; kernel threads and workqueues before doing bringing SMP back before userspace etc.. However due to the above mentioned races it is entirely possible for userspace tasks to thaw (by accident) before SMP is back. This can be a fatal problem in asymmetric ISA architectures (eg ARMv9) where the userspace task requires a special CPU to run. As said; replace this with a special task state TASK_FROZEN and add the following state transitions: TASK_FREEZABLE -> TASK_FROZEN __TASK_STOPPED -> TASK_FROZEN __TASK_TRACED -> TASK_FROZEN The new TASK_FREEZABLE can be set on any state part of TASK_NORMAL (IOW. TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE) -- any such state is already required to deal with spurious wakeups and the freezer causes one such when thawing the task (since the original state is lost). The special __TASK_{STOPPED,TRACED} states *can* be restored since their canonical state is in ->jobctl. With this, frozen tasks need an explicit TASK_FROZEN wakeup and are free of undue (early / spurious) wakeups. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lore.kernel.org/r/20220822114649.055452969@infradead.org
2022-09-03mm: pagewalk: Fix race between unmap and page walkerSteven Price2-11/+14
The mmap lock protects the page walker from changes to the page tables during the walk. However a read lock is insufficient to protect those areas which don't have a VMA as munmap() detaches the VMAs before downgrading to a read lock and actually tearing down PTEs/page tables. For users of walk_page_range() the solution is to simply call pte_hole() immediately without checking the actual page tables when a VMA is not present. We now never call __walk_page_range() without a valid vma. For walk_page_range_novma() the locking requirements are tightened to require the mmap write lock to be taken, and then walking the pgd directly with 'no_vma' set. This in turn means that all page walkers either have a valid vma, or it's that special 'novma' case for page table debugging. As a result, all the odd '(!walk->vma && !walk->no_vma)' tests can be removed. Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Steven Price <steven.price@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-09-01Merge tag 'slab-for-6.0-rc4' of ↵Linus Torvalds1-16/+29
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fix from Vlastimil Babka: - A fix from Waiman Long to avoid a theoretical deadlock reported by lockdep. * tag 'slab-for-6.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding slab_mutex/cpu_hotplug_lock
2022-09-01mm/slab_common: Deleting kobject in kmem_cache_destroy() without holding ↵Waiman Long1-16/+29
slab_mutex/cpu_hotplug_lock A circular locking problem is reported by lockdep due to the following circular locking dependency. +--> cpu_hotplug_lock --> slab_mutex --> kn->active --+ | | +-----------------------------------------------------+ The forward cpu_hotplug_lock ==> slab_mutex ==> kn->active dependency happens in kmem_cache_destroy(): cpus_read_lock(); mutex_lock(&slab_mutex); ==> sysfs_slab_unlink() ==> kobject_del() ==> kernfs_remove() ==> __kernfs_remove() ==> kernfs_drain(): rwsem_acquire(&kn->dep_map, ...); The backward kn->active ==> cpu_hotplug_lock dependency happens in kernfs_fop_write_iter(): kernfs_get_active(); ==> slab_attr_store() ==> cpu_partial_store() ==> flush_all(): cpus_read_lock() One way to break this circular locking chain is to avoid holding cpu_hotplug_lock and slab_mutex while deleting the kobject in sysfs_slab_unlink() which should be equivalent to doing a write_lock and write_unlock pair of the kn->active virtual lock. Since the kobject structures are not protected by slab_mutex or the cpu_hotplug_lock, we can certainly release those locks before doing the delete operation. Move sysfs_slab_unlink() and sysfs_slab_release() to the newly created kmem_cache_release() and call it outside the slab_mutex & cpu_hotplug_lock critical sections. There will be a slight delay in the deletion of sysfs files if kmem_cache_release() is called indirectly from a work function. Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Rientjes <rientjes@google.com> Link: https://lore.kernel.org/all/YwOImVd+nRUsSAga@hyeyoo/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/sl[au]b: check if large object is valid in __ksize()Hyeonggon Yoo1-1/+6
If address of large object is not beginning of folio or size of the folio is too small, it must be invalid. WARN() and return 0 in such cases. Cc: Marco Elver <elver@google.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/slab_common: move declaration of __ksize() to mm/slab.hHyeonggon Yoo3-11/+3
__ksize() is only called by KASAN. Remove export symbol and move declaration to mm/slab.h as we don't want to grow its callers. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/slab_common: drop kmem_alloc & avoid dereferencing fields when not usingHyeonggon Yoo4-29/+22
Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc using TRACE_EVENT() macro. And then this patch does: - Do not pass pointer to struct kmem_cache to trace_kmalloc. gfp flag is enough to know if it's accounted or not. - Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event. - Avoid dereferencing s->name in when not using kmem_cache_free event. - Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB Cc: Vasily Averin <vasily.averin@linux.dev> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/slab_common: unify NUMA and UMA version of tracepointsHyeonggon Yoo4-31/+25
Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and remove _node postfix for NUMA version of tracepoints. This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node, but at this point maintaining both kmem_alloc and kmem_alloc_node event classes does not makes sense at all. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/sl[au]b: cleanup kmem_cache_alloc[_node]_trace()Hyeonggon Yoo3-62/+27
Despite its name, kmem_cache_alloc[_node]_trace() is hook for inlined kmalloc. So rename it to kmalloc[_node]_trace(). Move its implementation to slab_common.c by using __kmem_cache_alloc_node(), but keep CONFIG_TRACING=n varients to save a function call when CONFIG_TRACING=n. Use __assume_kmalloc_alignment for kmalloc[_node]_trace instead of __assume_slab_alignement. Generally kmalloc has larger alignment requirements. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/sl[au]b: generalize kmalloc subsystemHyeonggon Yoo5-200/+107
Now everything in kmalloc subsystem can be generalized. Let's do it! Generalize __do_kmalloc_node(), __kmalloc_node_track_caller(), kfree(), __ksize(), __kmalloc(), __kmalloc_node() and move them to slab_common.c. In the meantime, rename kmalloc_large_node_notrace() to __kmalloc_large_node() and make it static as it's now only called in slab_common.c. [ feng.tang@intel.com: adjust kfence skip list to include __kmem_cache_free so that kfence kunit tests do not fail ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/rmap: Fix anon_vma->degree ambiguity leading to double-reuseJann Horn1-13/+16
anon_vma->degree tracks the combined number of child anon_vmas and VMAs that use the anon_vma as their ->anon_vma. anon_vma_clone() then assumes that for any anon_vma attached to src->anon_vma_chain other than src->anon_vma, it is impossible for it to be a leaf node of the VMA tree, meaning that for such VMAs ->degree is elevated by 1 because of a child anon_vma, meaning that if ->degree equals 1 there are no VMAs that use the anon_vma as their ->anon_vma. This assumption is wrong because the ->degree optimization leads to leaf nodes being abandoned on anon_vma_clone() - an existing anon_vma is reused and no new parent-child relationship is created. So it is possible to reuse an anon_vma for one VMA while it is still tied to another VMA. This is an issue because is_mergeable_anon_vma() and its callers assume that if two VMAs have the same ->anon_vma, the list of anon_vmas attached to the VMAs is guaranteed to be the same. When this assumption is violated, vma_merge() can merge pages into a VMA that is not attached to the corresponding anon_vma, leading to dangling page->mapping pointers that will be dereferenced during rmap walks. Fix it by separately tracking the number of child anon_vmas and the number of VMAs using the anon_vma as their ->anon_vma. Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Cc: stable@kernel.org Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-29cgroup: Fix build failure when CONFIG_SHRINKER_DEBUGTejun Heo1-1/+1
fa7e439cf90b ("cgroup: Homogenize cgroup_get_from_id() return value") broken build when CONFIG_SHRINKER_DEBUG by trying to return an errno from mem_cgroup_get_from_ino() which returns struct mem_cgroup *. Fix by using ERR_CAST() instead. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Michal Koutný <mkoutny@suse.com>f Fixes: fa7e439cf90b ("cgroup: Homogenize cgroup_get_from_id() return value")
2022-08-29mm/mprotect: only reference swap pfn page if type matchPeter Xu1-1/+2
Yu Zhao reported a bug after the commit "mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry" added a check in swp_offset_pfn() for swap type [1]: kernel BUG at include/linux/swapops.h:117! CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L 6.0.0-dbg-DEV #2 RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0 Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6 c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b 48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48 RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282 RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000 RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000 R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738 R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> change_pte_range+0x36e/0x880 change_p4d_range+0x2e8/0x670 change_protection_range+0x14e/0x2c0 mprotect_fixup+0x1ee/0x330 do_mprotect_pkey+0x34c/0x440 __x64_sys_mprotect+0x1d/0x30 It triggers because pfn_swap_entry_to_page() could be called upon e.g. a genuine swap entry. Fix it by only calling it when it's a write migration entry where the page* is used. [1] https://lore.kernel.org/lkml/CAOUHufaVC2Za-p8m0aiHw6YkheDcrO-C3wRGixwDS32VTS+k1w@mail.gmail.com/ Link: https://lkml.kernel.org/r/20220823221138.45602-1-peterx@redhat.com Fixes: 6c287605fd56 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Yu Zhao <yuzhao@google.com> Tested-by: Yu Zhao <yuzhao@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29mm/damon/dbgfs: avoid duplicate context directory creationBadari Pulavarty1-0/+3
When user tries to create a DAMON context via the DAMON debugfs interface with a name of an already existing context, the context directory creation fails but a new context is created and added in the internal data structure, due to absence of the directory creation success check. As a result, memory could leak and DAMON cannot be turned on. An example test case is as below: # cd /sys/kernel/debug/damon/ # echo "off" > monitor_on # echo paddr > target_ids # echo "abc" > mk_context # echo "abc" > mk_context # echo $$ > abc/target_ids # echo "on" > monitor_on <<< fails Return value of 'debugfs_create_dir()' is expected to be ignored in general, but this is an exceptional case as DAMON feature is depending on the debugfs functionality and it has the potential duplicate name issue. This commit therefore fixes the issue by checking the directory creation failure and immediately return the error in the case. Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: Badari Pulavarty <badari.pulavarty@intel.com> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [ 5.15.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29bootmem: remove the vmemmap pages from kmemleak in put_page_bootmemLiu Shixin1-0/+2
The vmemmap pages is marked by kmemleak when allocated from memblock. Remove it from kmemleak when freeing the page. Otherwise, when we reuse the page, kmemleak may report such an error and then stop working. kmemleak: Cannot insert 0xffff98fb6eab3d40 into the object search tree (overlaps existing) kmemleak: Kernel memory leak detector disabled kmemleak: Object 0xffff98fb6be00000 (size 335544320): kmemleak: comm "swapper", pid 0, jiffies 4294892296 kmemleak: min_count = 0 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: Link: https://lkml.kernel.org/r/20220819094005.2928241-1-liushixin2@huawei.com Fixes: f41f2ed43ca5 (mm: hugetlb: free the vmemmap pages associated with each HugeTLB page) Signed-off-by: Liu Shixin <liushixin2@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29mm/zsmalloc: do not attempt to free IS_ERR handleSergey Senozhatsky1-1/+1
zsmalloc() now returns ERR_PTR values as handles, which zram accidentally can pass to zs_free(). Another bad scenario is when zcomp_compress() fails - handle has default -ENOMEM value, and zs_free() will try to free that "pointer value". Add the missing check and make sure that zs_free() bails out when ERR_PTR() is passed to it. Link: https://lkml.kernel.org/r/20220816050906.2583956-1-senozhatsky@chromium.org Fixes: c7e6f17b52e9 ("zsmalloc: zs_malloc: return ERR_PTR on failure") Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29writeback: avoid use-after-free after removing deviceKhazhismel Kumykov2-6/+10
When a disk is removed, bdi_unregister gets called to stop further writeback and wait for associated delayed work to complete. However, wb_inode_writeback_end() may schedule bandwidth estimation dwork after this has completed, which can result in the timer attempting to access the just freed bdi_writeback. Fix this by checking if the bdi_writeback is alive, similar to when scheduling writeback work. Since this requires wb->work_lock, and wb_inode_writeback_end() may get called from interrupt, switch wb->work_lock to an irqsafe lock. Link: https://lkml.kernel.org/r/20220801155034.3772543-1-khazhy@google.com Fixes: 45a2966fd641 ("writeback: fix bandwidth estimate for spiky workload") Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Michael Stapelberg <stapelberg+linux@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29shmem: update folio if shmem_replace_page() updates the pageMatthew Wilcox (Oracle)1-0/+1
If we allocate a new page, we need to make sure that our folio matches that new page. If we do end up in this code path, we store the wrong page in the shmem inode's page cache, and I would rather imagine that data corruption ensues. This will be solved by changing shmem_replace_page() to shmem_replace_folio(), but this is the minimal fix. Link: https://lkml.kernel.org/r/20220730042518.1264767-1-willy@infradead.org Fixes: da08e9b79323 ("mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-29mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pteMiaohe Lin1-1/+1
In MCOPY_ATOMIC_CONTINUE case with a non-shared VMA, pages in the page cache are installed in the ptes. But hugepage_add_new_anon_rmap is called for them mistakenly because they're not vm_shared. This will corrupt the page->mapping used by page cache code. Link: https://lkml.kernel.org/r/20220712130542.18836-1-linmiaohe@huawei.com Fixes: f619147104c8 ("userfaultfd: add UFFDIO_CONTINUE ioctl") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-26cgroup: Homogenize cgroup_get_from_id() return valueMichal Koutný1-2/+2
Cgroup id is user provided datum hence extend its return domain to include possible error reason (similar to cgroup_get_from_fd()). This change also fixes commit d4ccaf58a847 ("bpf: Introduce cgroup iter") that would use NULL instead of proper error handling in d4ccaf58a847 ("bpf: Introduce cgroup iter"). Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup, mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive CONFIG_MEMCG respectively) transitive, so drop the singular definition not needed with !CONFIG_CGROUPS. Fixes: d4ccaf58a847 ("bpf: Introduce cgroup iter") Signed-off-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-25mm/slub: move free_debug_processing() furtherVlastimil Babka1-57/+57
In the following patch, the function free_debug_processing() will be calling add_partial(), remove_partial() and discard_slab(), se move it below their definitions to avoid forward declarations. To make review easier, separate the move from functional changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
2022-08-24mm: add NR_SECONDARY_PAGETABLE to count secondary page table uses.Yosry Ahmed3-1/+7
We keep track of several kernel memory stats (total kernel memory, page tables, stack, vmalloc, etc) on multiple levels (global, per-node, per-memcg, etc). These stats give insights to users to how much memory is used by the kernel and for what purposes. Currently, memory used by KVM mmu is not accounted in any of those kernel memory stats. This patch series accounts the memory pages used by KVM for page tables in those stats in a new NR_SECONDARY_PAGETABLE stat. This stat can be later extended to account for other types of secondary pages tables (e.g. iommu page tables). KVM has a decent number of large allocations that aren't for page tables, but for most of them, the number/size of those allocations scales linearly with either the number of vCPUs or the amount of memory assigned to the VM. KVM's secondary page table allocations do not scale linearly, especially when nested virtualization is in use. From a KVM perspective, NR_SECONDARY_PAGETABLE will scale with KVM's per-VM pages_{4k,2m,1g} stats unless the guest is doing something bizarre (e.g. accessing only 4kb chunks of 2mb pages so that KVM is forced to allocate a large number of page tables even though the guest isn't accessing that much memory). However, someone would need to either understand how KVM works to make that connection, or know (or be told) to go look at KVM's stats if they're running VMs to better decipher the stats. Furthermore, having NR_PAGETABLE side-by-side with NR_SECONDARY_PAGETABLE is informative. For example, when backing a VM with THP vs. HugeTLB, NR_SECONDARY_PAGETABLE is roughly the same, but NR_PAGETABLE is an order of magnitude higher with THP. So having this stat will at the very least prove to be useful for understanding tradeoffs between VM backing types, and likely even steer folks towards potential optimizations. The original discussion with more details about the rationale: https://lore.kernel.org/all/87ilqoi77b.wl-maz@kernel.org This stat will be used by subsequent patches to count KVM mmu memory usage. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20220823004639.2387269-2-yosryahmed@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2022-08-24mm/sl[au]b: introduce common alloc/free functions without tracepointHyeonggon Yoo3-7/+47
To unify kmalloc functions in later patch, introduce common alloc/free functions that does not have tracepoint. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab: kmalloc: pass requests larger than order-1 page to page allocatorHyeonggon Yoo4-44/+63
There is not much benefit for serving large objects in kmalloc(). Let's pass large requests to page allocator like SLUB for better maintenance of common code. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: cleanup kmalloc_large()Hyeonggon Yoo1-22/+13
Now that kmalloc_large() and kmalloc_large_node() do mostly same job, make kmalloc_large() wrapper of kmalloc_large_node_notrace(). In the meantime, add missing flag fix code in kmalloc_large_node_notrace(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: kmalloc_node: pass large requests to page allocatorHyeonggon Yoo3-2/+13
Now that kmalloc_large_node() is in common code, pass large requests to page allocator in kmalloc_node() using kmalloc_large_node(). One problem is that currently there is no tracepoint in kmalloc_large_node(). Instead of simply putting tracepoint in it, use kmalloc_large_node{,_notrace} depending on its caller to show useful address for both inlined kmalloc_node() and __kmalloc_node_track_caller() when large objects are allocated. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slub: move kmalloc_large_node() to slab_common.cHyeonggon Yoo2-25/+22
In later patch SLAB will also pass requests larger than order-1 page to page allocator. Move kmalloc_large_node() to slab_common.c. Fold kmalloc_large_node_hook() into kmalloc_large_node() as there is no other caller. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: fold kmalloc_order_trace() into kmalloc_large()Hyeonggon Yoo1-13/+4
There is no caller of kmalloc_order_trace() except kmalloc_large(). Fold it into kmalloc_large() and remove kmalloc_order{,_trace}(). Also add tracepoint in kmalloc_large() that was previously in kmalloc_order_trace(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/sl[au]b: factor out __do_kmalloc_node()Hyeonggon Yoo2-81/+20
__kmalloc(), __kmalloc_node(), __kmalloc_node_track_caller() mostly do same job. Factor out common code into __do_kmalloc_node(). Note that this patch also fixes missing kasan_kmalloc() in SLUB's __kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: cleanup kmalloc_track_caller()Hyeonggon Yoo3-34/+0
Make kmalloc_track_caller() wrapper of kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: remove CONFIG_NUMA ifdefs for common kmalloc functionsHyeonggon Yoo3-12/+1
Now that slab_alloc_node() is available for SLAB when CONFIG_NUMA=n, remove CONFIG_NUMA ifdefs for common kmalloc functions. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab: cleanup slab_alloc() and slab_alloc_node()Hyeonggon Yoo1-36/+13
Make slab_alloc_node() available even when CONFIG_NUMA=n and make slab_alloc() wrapper of slab_alloc_node(). This is necessary for further cleanup. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab: move NUMA-related code to __do_cache_alloc()Hyeonggon Yoo1-37/+31
To implement slab_alloc_node() independent of NUMA configuration, move NUMA fallback/alternate allocation code into __do_cache_alloc(). One functional change here is not to check availability of node when allocating from local node. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-23mm/slub: Remove the unneeded result variableye xingchen1-7/+2
Return the value from attribute->store(s, buf, len) and attribute->show(s, buf) directly instead of storing it in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-23mm/slab_common: Remove the unneeded result variableye xingchen1-5/+1
Return the value from __kmem_cache_shrink() directly instead of storing it in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-21mm/shmem: shmem_replace_page() remember NR_SHMEMHugh Dickins1-0/+2
Elsewhere, NR_SHMEM is updated at the same time as shmem NR_FILE_PAGES; but shmem_replace_page() was forgetting to do that - so NR_SHMEM stats could grow too big or too small, in those unusual cases when it's used. Link: https://lkml.kernel.org/r/cec7c09d-5874-e160-ada6-6e10ee48784@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Radoslaw Burny <rburny@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/shmem: tmpfs fallocate use file_modified()Hugh Dickins1-1/+2
5.18 fixed the btrfs and ext4 fallocates to use file_modified(), as xfs was already doing, to drop privileges: and fstests generic/{683,684,688} expect this. There's no need to argue over keep-size allocation (which could just update ctime): fix shmem_fallocate() to behave the same way. Link: https://lkml.kernel.org/r/39c5e62-4896-7795-c0a0-f79c50d4909@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Radoslaw Burny <rburny@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/shmem: fix chattr fsflags support in tmpfsHugh Dickins1-23/+31
ext[234] have always allowed unimplemented chattr flags to be set, but other filesystems have tended to be stricter. Follow the stricter approach for tmpfs: I don't want to have to explain why csu attributes don't actually work, and we won't need to update the chattr(1) manpage; and it's never wrong to start off strict, relaxing later if persuaded. Allow only a (append only) i (immutable) A (no atime) and d (no dump). Although lsattr showed 'A' inherited, the NOATIME behavior was not being inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME. Add shmem_set_inode_flags() to sync the flags, using inode_set_flags() to avoid that instant of lost immutablility during fileattr_set(). But that change switched generic/079 from passing to failing: because FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the INHERITED fsflags: remove them and generic/079 is back to passing. Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com Fixes: e408e695f5f1 ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Radoslaw Burny <rburny@google.com> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/hugetlb: support write-faults in shared mappingsDavid Hildenbrand1-7/+19
If we ever get a write-fault on a write-protected page in a shared mapping, we'd be in trouble (again). Instead, we can simply map the page writable. And in fact, there is even a way right now to trigger that code via uffd-wp ever since we stared to support it for shmem in 5.19: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #include <sys/syscall.h> #include <sys/ioctl.h> #include <linux/userfaultfd.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static char *map; int uffd; static int temp_setup_uffd(void) { struct uffdio_api uffdio_api; struct uffdio_register uffdio_register; struct uffdio_writeprotect uffd_writeprotect; struct uffdio_range uffd_range; uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY); if (uffd < 0) { fprintf(stderr, "syscall() failed: %d\n", errno); return -errno; } uffdio_api.api = UFFD_API; uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP; if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) { fprintf(stderr, "UFFDIO_API failed: %d\n", errno); return -errno; } if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) { fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n"); return -ENOSYS; } /* Register UFFD-WP */ uffdio_register.range.start = (unsigned long) map; uffdio_register.range.len = HUGETLB_SIZE; uffdio_register.mode = UFFDIO_REGISTER_MODE_WP; if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) { fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno); return -errno; } /* Writeprotect a single page. */ uffd_writeprotect.range.start = (unsigned long) map; uffd_writeprotect.range.len = HUGETLB_SIZE; uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP; if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) { fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno); return -errno; } /* Unregister UFFD-WP without prior writeunprotection. */ uffd_range.start = (unsigned long) map; uffd_range.len = HUGETLB_SIZE; if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) { fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno); return -errno; } return 0; } int main(int argc, char **argv) { int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (temp_setup_uffd()) return 1; *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when unregistering and consequently keeps the PTE writeprotected. Reason for this is to avoid the additional overhead when unregistering. Note that this is the case also for !hugetlb and that we will end up with writable PTEs that still have the uffd-wp PTE bit set once we return from hugetlb_wp(). I'm not touching the uffd-wp PTE bit for now, because it seems to be a generic thing -- wp_page_reuse() also doesn't clear it. VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates that MAP_SHARED handling was at least envisioned, but could never have worked as expected. While at it, make sure that we never end up in hugetlb_wp() on write faults without VM_WRITE, because we don't support maybe_mkwrite() semantics as commonly used in the !hugetlb case -- for example, in wp_page_reuse(). Note that there is no need to do any kind of reservation in hugetlb_fault() in this case ... because we already have a hugetlb page mapped R/O that we will simply map writable and we are not dealing with COW/unsharing. Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [5.19] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/hugetlb: fix hugetlb not supporting softdirty trackingDavid Hildenbrand1-2/+5
Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2. I observed that hugetlb does not support/expect write-faults in shared mappings that would have to map the R/O-mapped page writable -- and I found two case where we could currently get such faults and would erroneously map an anon page into a shared mapping. Reproducers part of the patches. I propose to backport both fixes to stable trees. The first fix needs a small adjustment. This patch (of 2): Staring at hugetlb_wp(), one might wonder where all the logic for shared mappings is when stumbling over a write-protected page in a shared mapping. In fact, there is none, and so far we thought we could get away with that because e.g., mprotect() should always do the right thing and map all pages directly writable. Looks like we were wrong: -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <sys/mman.h> #define HUGETLB_SIZE (2 * 1024 * 1024u) static void clear_softdirty(void) { int fd = open("/proc/self/clear_refs", O_WRONLY); const char *ctrl = "4"; int ret; if (fd < 0) { fprintf(stderr, "open(clear_refs) failed\n"); exit(1); } ret = write(fd, ctrl, strlen(ctrl)); if (ret != strlen(ctrl)) { fprintf(stderr, "write(clear_refs) failed\n"); exit(1); } close(fd); } int main(int argc, char **argv) { char *map; int fd; fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT); if (!fd) { fprintf(stderr, "open() failed\n"); return -errno; } if (ftruncate(fd, HUGETLB_SIZE)) { fprintf(stderr, "ftruncate() failed\n"); return -errno; } map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return -errno; } *map = 0; if (mprotect(map, HUGETLB_SIZE, PROT_READ)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } clear_softdirty(); if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) { fprintf(stderr, "mmprotect() failed\n"); return -errno; } *map = 0; return 0; } -------------------------------------------------------------------------- Above test fails with SIGBUS when there is only a single free hugetlb page. # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test Bus error (core dumped) And worse, with sufficient free hugetlb pages it will map an anonymous page into a shared mapping, for example, messing up accounting during unmap and breaking MAP_SHARED semantics: # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # ./test # cat /proc/meminfo | grep HugePages_ HugePages_Total: 2 HugePages_Free: 1 HugePages_Rsvd: 18446744073709551615 HugePages_Surp: 0 Reason in this particular case is that vma_wants_writenotify() will return "true", removing VM_SHARED in vma_set_page_prot() to map pages write-protected. Let's teach vma_wants_writenotify() that hugetlb does not support softdirty tracking. Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com Fixes: 64e455079e1b ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared") Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Feiner <pfeiner@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Jamie Liu <jamieliu@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/uffd: reset write protection when unregister with wp-modePeter Xu1-11/+18
The motivation of this patch comes from a recent report and patchfix from David Hildenbrand on hugetlb shared handling of wr-protected page [1]. With the reproducer provided in commit message of [1], one can leverage the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect not only the attacker process, but also the whole system. The lazy-reset mechanism of uffd-wp was used to make unregister faster, meanwhile it has an assumption that any leftover pgtable entries should only affect the process on its own, so not only the user should be aware of anything it does, but also it should not affect outside of the process. But it seems that this is not true, and it can also be utilized to make some exploit easier. So far there's no clue showing that the lazy-reset is important to any userfaultfd users because normally the unregister will only happen once for a specific range of memory of the lifecycle of the process. Considering all above, what this patch proposes is to do explicit pte resets when unregister an uffd region with wr-protect mode enabled. It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false) right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll make the unregister slower. From that pov it's a very slight abi change, but hopefully nothing should break with this change either. Regarding to the change itself - core of uffd write [un]protect operation is moved into a separate function (uffd_wp_range()) and it is reused in the unregister code path. Note that the new function will not check for anything, e.g. ranges or memory types, because they should have been checked during the previous UFFDIO_REGISTER or it should have failed already. It also doesn't check mmap_changing because we're with mmap write lock held anyway. I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's the only issue reported so far and that's the commit David's reproducer will start working (v5.19+). But the whole idea actually applies to not only file memories but also anonymous. It's just that we don't need to fix anonymous prior to v5.19- because there's no known way to exploit. IOW, this patch can also fix the issue reported in [1] as the patch 2 does. [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/ Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com Fixes: b1f9e876862d ("mm/uffd: enable write protection for shmem & hugetlbfs") Signed-off-by: Peter Xu <peterx@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm: add DEVICE_ZONE to FOR_ALL_ZONESHao Lee1-1/+8
FOR_ALL_ZONES should be consistent with enum zone_type. Otherwise, __count_zid_vm_events have the potential to add count to wrong item when zid is ZONE_DEVICE. Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.io Signed-off-by: Hao Lee <haolee.swjtu@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-21mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COWDavid Hildenbrand2-43/+89
Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know that FOLL_FORCE can be possibly dangerous, especially if there are races that can be exploited by user space. Right now, it would be sufficient to have some code that sets a PTE of a R/O-mapped shared page dirty, in order for it to erroneously become writable by FOLL_FORCE. The implications of setting a write-protected PTE dirty might not be immediately obvious to everyone. And in fact ever since commit 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map a shmem page R/O while marking the pte dirty. This can be used by unprivileged user space to modify tmpfs/shmem file content even if the user does not have write permissions to the file, and to bypass memfd write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590). To fix such security issues for good, the insight is that we really only need that fancy retry logic (FOLL_COW) for COW mappings that are not writable (!VM_WRITE). And in a COW mapping, we really only broke COW if we have an exclusive anonymous page mapped. If we have something else mapped, or the mapped anonymous page might be shared (!PageAnonExclusive), we have to trigger a write fault to break COW. If we don't find an exclusive anonymous page when we retry, we have to trigger COW breaking once again because something intervened. Let's move away from this mandatory-retry + dirty handling and rely on our PageAnonExclusive() flag for making a similar decision, to use the same COW logic as in other kernel parts here as well. In case we stumble over a PTE in a COW mapping that does not map an exclusive anonymous page, COW was not properly broken and we have to trigger a fake write-fault to break COW. Just like we do in can_change_pte_writable() added via commit 64fe24a3e05e ("mm/mprotect: try avoiding write faults for exclusive anonymous pages when changing protection") and commit 76aefad628aa ("mm/mprotect: fix soft-dirty check in can_change_pte_writable()"), take care of softdirty and uffd-wp manually. For example, a write() via /proc/self/mem to a uffd-wp-protected range has to fail instead of silently granting write access and bypassing the userspace fault handler. Note that FOLL_FORCE is not only used for debug access, but also triggered by applications without debug intentions, for example, when pinning pages via RDMA. This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR. Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So let's just get rid of it. Thanks to Nadav Amit for pointing out that the pte_dirty() check in FOLL_FORCE code is problematic and might be exploitable. Note 1: We don't check for the PTE being dirty because it doesn't matter for making a "was COWed" decision anymore, and whoever modifies the page has to set the page dirty either way. Note 2: Kernels before extended uffd-wp support and before PageAnonExclusive (< 5.19) can simply revert the problematic commit instead and be safe regarding UFFDIO_CONTINUE. A backport to v5.19 requires minor adjustments due to lack of vma_soft_dirty_enabled(). Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com Fixes: 9ae0f87d009c ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Xu <peterx@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.16] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-10Merge tag 'mm-stable-2022-08-09' of ↵Linus Torvalds5-601/+684
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull remaining MM updates from Andrew Morton: "Three patch series - two that perform cleanups and one feature: - hugetlb_vmemmap cleanups from Muchun Song - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi - highmem documentation fixups from Fabio De Francesco" * tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits) Documentation/mm: add details about kmap_local_page() and preemption highmem: delete a sentence from kmap_local_page() kdocs Documentation/mm: rrefer kmap_local_page() and avoid kmap() Documentation/mm: avoid invalid use of addresses from kmap_local_page() Documentation/mm: don't kmap*() pages which can't come from HIGHMEM highmem: specify that kmap_local_page() is callable from interrupts highmem: remove unneeded spaces in kmap_local_page() kdocs mm, hwpoison: enable memory error handling on 1GB hugepage mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage mm, hwpoison: make __page_handle_poison returns int mm, hwpoison: set PG_hwpoison for busy hugetlb pages mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage mm, hwpoison, hugetlb: support saving mechanism of raw error pages mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages() mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability mm: hugetlb_vmemmap: replace early_param() with core_param() mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c ...
2022-08-10Merge tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxlLinus Torvalds1-0/+5
Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for 6.0: - Introduce a 'struct cxl_region' object with support for provisioning and assembling persistent memory regions. - Introduce alloc_free_mem_region() to accompany the existing request_free_mem_region() as a method to allocate physical memory capacity out of an existing resource. - Export insert_resource_expand_to_fit() for the CXL subsystem to late-publish CXL platform windows in iomem_resource. - Add a polled mode PCI DOE (Data Object Exchange) driver service and use it in cxl_pci to retrieve the CDAT (Coherent Device Attribute Table)" * tag 'cxl-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (74 commits) cxl/hdm: Fix skip allocations vs multiple pmem allocations cxl/region: Disallow region granularity != window granularity cxl/region: Fix x1 interleave to greater than x1 interleave routing cxl/region: Move HPA setup to cxl_region_attach() cxl/region: Fix decoder interleave programming Documentation: cxl: remove dangling kernel-doc reference cxl/region: describe targets and nr_targets members of cxl_region_params cxl/regions: add padding for cxl_rr_ep_add nested lists cxl/region: Fix IS_ERR() vs NULL check cxl/region: Fix region reference target accounting cxl/region: Fix region commit uninitialized variable warning cxl/region: Fix port setup uninitialized variable warnings cxl/region: Stop initializing interleave granularity cxl/hdm: Fix DPA reservation vs cxl_endpoint_decoder lifetime cxl/acpi: Minimize granularity for x1 interleaves cxl/region: Delete 'region' attribute from root decoders cxl/acpi: Autoload driver for 'cxl_acpi' test devices cxl/region: decrement ->nr_targets on error in cxl_region_attach() cxl/region: prevent underflow in ways_to_cxl() cxl/region: uninitialized variable in alloc_hpa() ...
2022-08-09Merge tag 'memblock-v5.20-rc1' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock updates from Mike Rapoport: - An optimization in memblock_add_range() to reduce array traversals - Improvements to the memblock test suite * tag 'memblock-v5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock test: Modify the obsolete description in README memblock tests: fix compilation errors memblock tests: change build options to run-time options memblock tests: remove completed TODO items memblock tests: set memblock_debug to enable memblock_dbg() messages memblock tests: add verbose output to memblock tests memblock tests: Makefile: add arguments to control verbosity memblock: avoid some repeat when add new range
2022-08-09Merge tag 'pull-work.iov_iter-rebased' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more iov_iter updates from Al Viro: - more new_sync_{read,write}() speedups - ITER_UBUF introduction - ITER_PIPE cleanups - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and switching them to advancing semantics - making ITER_PIPE take high-order pages without splitting them - handling copy_page_from_iter() for high-order pages properly * tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits) fix copy_page_from_iter() for compound destinations hugetlbfs: copy_page_to_iter() can deal with compound pages copy_page_to_iter(): don't split high-order page in case of ITER_PIPE expand those iov_iter_advance()... pipe_get_pages(): switch to append_pipe() get rid of non-advancing variants ceph: switch the last caller of iov_iter_get_pages_alloc() 9p: convert to advancing variant of iov_iter_get_pages_alloc() af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages() iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() block: convert to advancing variants of iov_iter_get_pages{,_alloc}() iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() iov_iter: saner helper for page array allocation fold __pipe_get_pages() into pipe_get_pages() ITER_XARRAY: don't open-code DIV_ROUND_UP() unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts unify xarray_get_pages() and xarray_get_pages_alloc() unify pipe_get_pages() and pipe_get_pages_alloc() iov_iter_get_pages(): sanity-check arguments iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper ...