kernel/linux.git/mm, branch v6.12.94

mm/hugetlb: avoid false positive lockdep assertion

2026-06-19T11:42:37+00:00

[ Upstream commit b4aea43cd37afad714b5684fe9fdfcb0e78dba26 ] Commit 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before") changed the locking model around hugetlbfs PMD unsharing on VMA split, but did not update the function which asserts the locks, hugetlb_vma_assert_locked(). This function asserts that either the hugetlb VMA lock is held (if a shared mapping) or that the reservation map lock is held (if private). If you get an unfortunate race between something which results in one of these locks being released and a hugetlb VMA split and you have CONFIG_LOCKDEP enabled, you can therefore see a false positive assertion arise when there is in fact no issue. Since this change introduced a new take_locks parameter to hugetlb_unshare_pmds(), which, when set to false, indicates that locking is sufficient, simply pass this to the unsharing logic and predicate the lock assertions on this. This is safe, as we already asserted the file rmap lock and the VMA write lock prior to this (implying exclusive mmap write lock), so we cannot be raced by either rmap or page fault page table walkers which the asserted locks are intended to protect against (we don't mind GUP-fast). Separate out huge_pmd_unshare() into __huge_pmd_unshare() to add a check_locks parameter, and update hugetlb_unshare_pmds() to pass this parameter to it. This leaves all other callers of huge_pmd_unshare() still correctly asserting the locks. The below reproducer will trigger the assert in a kernel with CONFIG_LOCKDEP enabled by racing process teardown (which will release the hugetlb lock) against a hugetlb split. void execute_one(void) { void *ptr; pid_t pid; /* * Create a hugetlb mapping spanning a PUD entry. * * We force the hugetlb page allocation with populate and * noreserve. * * |---------------------| * | | * |---------------------| * 0 PUD boundary */ ptr = mmap(0, PUD_SIZE, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED | MAP_ANON | MAP_NORESERVE | MAP_HUGETLB | MAP_POPULATE, -1, 0); if (ptr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); } /* * Fork but with a bogus stack pointer so we try to execute code in * a non-VM_EXEC VMA, causing segfault + teardown via exit_mmap(). * * The clone will cause PMD page table sharing between the * processes first via: * copy_process() -> ... -> huge_pte_alloc() -> huge_pmd_share() * * Then tear down and release the hugetlb 'VMA' lock via: * exit_mmap() -> ... -> vma_close() -> hugetlb_vma_lock_free() */ pid = syscall(__NR_clone, 0, 2 * PMD_SIZE, 0, 0, 0); if (pid < 0) { perror("clone"); exit(EXIT_FAILURE); } if (pid == 0) { /* Pop stack... */ return; } /* * We are the parent process. * * Race the child process's teardown with a PMD unshare. * * We do this by triggering: * * __split_vma() -> hugetlb_split() -> hugetlb_unshare_pmds() * * Which, importantly, doesn't hold the hugetlb VMA lock (nor can * it), meaning we assert in hugetlb_vma_assert_locked(). * * . * |----------.----------| * | . | * |----------.----------| * 0 . PUD boundary */ mmap(0, PUD_SIZE / 2, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0); } int main(void) { int i; /* Kick off fork children. */ for (i = 0; i < NUM_FORKS; i++) { pid_t pid = fork(); if (pid < 0) { perror("fork"); exit(EXIT_FAILURE); } /* Fork children do their work and exit. */ if (!pid) { int j; for (j = 0; j < NUM_ITERS; j++) execute_one(); return EXIT_SUCCESS; } } /* If we succeeded, wait on children. */ for (i = 0; i < NUM_FORKS; i++) wait(NULL); return EXIT_SUCCESS; } [ljs@kernel.org: account for the !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING case] Link: https://lore.kernel.org/agWZsPGYid08uU6O@lucifer Link: https://lore.kernel.org/20260513085658.45264-1-ljs@kernel.org Fixes: 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before") Signed-off-by: Lorenzo Stoakes Acked-by: David Hildenbrand (Arm) Acked-by: Oscar Salvador Cc: Jann Horn Cc: Muchun Song Cc: Signed-off-by: Andrew Morton Signed-off-by: Lorenzo Stoakes Signed-off-by: Sasha Levin

mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

2026-06-19T11:42:37+00:00

[ Upstream commit 3c2d42b8ee345b17a4ba56b0f6492d1ff4c1178e ] Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock when racing with a concurrent unmap: thread#0 thread#1 -------- -------- madvise(folio, MADV_HWPOISON) -> poisons the folio successfully madvise(folio, MADV_HWPOISON) unmap(folio) try_memory_failure_hugetlb get_huge_page_for_hwpoison spin_lock_irq(&hugetlb_lock) <- held __get_huge_page_for_hwpoison hugetlb_update_hwpoison() -> MF_HUGETLB_FOLIO_PRE_POISONED goto out: folio_put() refcount: 1 -> 0 free_huge_folio() spin_lock_irqsave(&hugetlb_lock) -> AA DEADLOCK! The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop the GUP reference while the hugetlb_lock is still held by the hugetlb.c wrapper get_huge_page_for_hwpoison(). If concurrent unmap has released the page table mapping reference, folio_put() drops the folio refcount to zero, triggering free_huge_folio() which attempts to re-acquire the non-recursive hugetlb_lock. Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper into get_huge_page_for_hwpoison(). Place spin_unlock_irq() before the folio_put() at the out: label so the folio is always released outside the lock. [akpm@linux-foundation.org: fix race, rename label per Miaohe] Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Wupeng Ma Acked-by: Oscar Salvador (SUSE) Acked-by: Muchun Song Reviewed-by: Kefeng Wang Acked-by: Miaohe Lin Cc: David Hildenbrand Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport Cc: Naoya Horiguchi Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm/hugetlb: restore reservation on error in hugetlb folio copy paths

2026-06-19T11:42:33+00:00

commit 40c81856e622a9dc59294a90d169ac07ea25b0b0 upstream. Two sites in mm/hugetlb.c allocate a hugetlb folio via alloc_hugetlb_folio() (consuming a VMA reservation) and then call copy_user_large_folio(), which became int-returning in commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") and can now fail (e.g. -EHWPOISON on a hwpoisoned source page). On the failure path, folio_put() restores the global hugetlb pool count through free_huge_folio(), but the per-VMA reservation map entry is left marked consumed: - hugetlb_mfill_atomic_pte() resubmission path (UFFDIO_COPY) - copy_hugetlb_page_range() fork-time CoW path when hugetlb_try_dup_anon_rmap() fails (rare: pinned hugetlb anon folio under fork) User-visible effect: on UFFDIO_COPY into a private hugetlb VMA where the resubmission copy fails, the reservation for that address is leaked from the VMA's reserve map. A subsequent fault at the same address takes the no-reservation path, and under hugetlb pool pressure the task is SIGBUSed at an address it had previously reserved. The fork-time CoW path leaks the same way in the child VMA's reserve map, though it requires the much rarer combination of pinned hugetlb anon page + hwpoisoned source. Add the missing restore_reserve_on_error() call before folio_put() on both error paths. Link: https://lore.kernel.org/20260520044912.6751-1-devnexen@gmail.com Fixes: 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") Signed-off-by: David Carlier Reviewed-by: Muchun Song Cc: David Hildenbrand Cc: Mina Almasry Cc: Muchun Song Cc: Oscar Salvador Cc: yuehaibing Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/damon/ops-common: call folio_test_lru() after folio_get()

2026-06-19T11:42:29+00:00

commit d6b8b02a27b3dd09ec12144322b3dac46d9bc9ef upstream. damon_get_folio() speculatively calls folio_test_lru() before folio_try_get(). The folio can get freed and reallocated to a tail page. In the case, VM_BUG_ON_PGFLAGS() in const_folio_flags() can be triggered. Remove the speculative call. Also mark folio_test_lru() check right after folio_try_get() success as no more unlikely. The race should be rare. Also the problem can happen only if the kernel has enabled CONFIG_DEBUG_VM_PGFLAGS. No real world report of this issue has been made so far. This fix is based on only theoretical analysis. That said, a bug is a bug. A similar issue was also fixed via commit 3203b3ab0fcf ("mm/filemap: don't call folio_test_locked() without a reference in next_uptodate_folio()"). I don't expect this change will make a meaningful impact to DAMON performance in the real world, though I will be happy to be corrected from the real world reports. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260525162256.8317-1-sj@kernel.org Link: https://lore.kernel.org/20260517234112.89245-1-sj@kernel.org [1] Fixes: 3f49584b262c ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: SeongJae Park Cc: Fernand Sieber Cc: Leonard Foerster Cc: Shakeel Butt Cc: # 5.15.x Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

mm/huge_memory: update file PMD counter before folio_put()

2026-06-19T11:42:29+00:00

commit 8d878059924f12c1bc24556a92ec56add74de3c8 upstream. __split_huge_pmd_locked() updates the file/shmem RSS counter after dropping the PMD mapping's folio reference. If folio_put() drops the last reference, mm_counter_file() can later read freed folio state via folio_test_swapbacked(). Move the counter update before folio_put(). Link: https://lore.kernel.org/20260526101337.1984081-1-yintirui@huawei.com Fixes: fadae2953072 ("thp: use mm_file_counter to determine update which rss counter") Signed-off-by: Yin Tirui Reviewed-by: Lorenzo Stoakes Acked-by: David Hildenbrand (arm) Reviewed-by: Lance Yang Reviewed-by: Dev Jain Cc: Baolin Wang Cc: Barry Song Cc: Chen Jun Cc: Kefeng Wang Cc: Liam R. Howlett Cc: Nico Pache Cc: Ryan Roberts Cc: Vlastimil Babka Cc: Yang Shi Cc: Zi Yan Cc: Signed-off-by: Andrew Morton Signed-off-by: Greg Kroah-Hartman

writeback: Avoid contention on wb->list_lock when switching inodes

2026-06-19T11:42:26+00:00

[ Upstream commit e1b849cfa6b61f1c866a908c9e8dd9b5aaab820b ] There can be multiple inode switch works that are trying to switch inodes to / from the same wb. This can happen in particular if some cgroup exits which owns many (thousands) inodes and we need to switch them all. In this case several inode_switch_wbs_work_fn() instances will be just spinning on the same wb->list_lock while only one of them makes forward progress. This wastes CPU cycles and quickly leads to softlockup reports and unusable system. Instead of running several inode_switch_wbs_work_fn() instances in parallel switching to the same wb and contending on wb->list_lock, run just one work item per wb and manage a queue of isw items switching to this wb. Acked-by: Tejun Heo Signed-off-by: Jan Kara Signed-off-by: Sasha Levin

memfd: deny writeable mappings when implying SEAL_WRITE

2026-06-09T10:26:05+00:00

[ Upstream commit 3b041514cb6eae45869b020f743c14d983363222 ] When SEAL_EXEC is added, SEAL_WRITE is implied to make W^X. But the implied seal is set after the check that makes sure the memfd can not have any writable mappings. This means one can use SEAL_EXEC to apply SEAL_WRITE while having writeable mappings. This breaks the contract that SEAL_WRITE provides and can be used by an attacker to pass a memfd that appears to be write sealed but can still be modified arbitrarily. Fix this by adding the implied seals before the call for mapping_deny_writable() is done. Link: https://lore.kernel.org/20260505133922.797635-1-pratyush@kernel.org Fixes: c4f75bc8bd6b ("mm/memfd: add write seals when apply SEAL_EXEC to executable memfd") Signed-off-by: Pratyush Yadav (Google) Reviewed-by: Pasha Tatashin Acked-by: Jeff Xu Cc: Baolin Wang Cc: Brendan Jackman Cc: Greg Thelen Cc: Hugh Dickins Cc: Kees Cook Cc: "David Hildenbrand (Arm)" Cc: Signed-off-by: Andrew Morton Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm/memfd: fix spelling and grammatical issues

2026-06-09T10:26:05+00:00

[ Upstream commit 33c9b01ed2fcbc101cdfeb497f4581e981e7c1e7 ] The comment "If a private mapping then writability is irrelevant" contains a typo. It should be "If a private mapping then writability is irrelevant". The comment "SEAL_EXEC implys SEAL_WRITE, making W^X from the start." contains a typo. It should be "SEAL_EXEC implies SEAL_WRITE, making W^X from the start." Link: https://lkml.kernel.org/r/20250206060958.98010-1-liuye@kylinos.cn Signed-off-by: Liu Ye Signed-off-by: Andrew Morton Stable-dep-of: 3b041514cb6e ("memfd: deny writeable mappings when implying SEAL_WRITE") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm: perform all memfd seal checks in a single place

2026-06-09T10:26:05+00:00

[ Upstream commit fa00b8ef1803fe133b4897c25227aa0d298dd093 ] We no longer actually need to perform these checks in the f_op->mmap() hook any longer. We already moved the operation which clears VM_MAYWRITE on a read-only mapping of a write-sealed memfd in order to work around the restrictions imposed by commit 5de195060b2e ("mm: resolve faulty mmap_region() error path behaviour"). There is no reason for us not to simply go ahead and additionally check to see if any pre-existing seals are in place here rather than defer this to the f_op->mmap() hook. By doing this we remove more logic from shmem_mmap() which doesn't belong there, as well as doing the same for hugetlbfs_file_mmap(). We also remove dubious shared logic in mm.h which simply does not belong there either. It makes sense to do these checks at the earliest opportunity, we know these are shmem (or hugetlbfs) mappings whose relevant VMA flags will not change from the invoking do_mmap() so there is simply no need to wait. This also means the implementation of further memfd seal flags can be done within mm/memfd.c and also have the opportunity to modify VMA flags as necessary early in the mapping logic. [lorenzo.stoakes@oracle.com: fix typos in !memfd inline stub] Link: https://lkml.kernel.org/r/7dee6c5d-480b-4c24-b98e-6fa47dbd8a23@lucifer.local Link: https://lkml.kernel.org/r/20241206212846.210835-1-lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes Tested-by: Isaac J. Manjarres Cc: Hugh Dickins Cc: Jann Horn Cc: Kalesh Singh Cc: Liam R. Howlett Cc: Muchun Song Cc: Vlastimil Babka Cc: Jeff Xu Signed-off-by: Andrew Morton Stable-dep-of: 3b041514cb6e ("memfd: deny writeable mappings when implying SEAL_WRITE") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm/damon/sysfs-schemes: delete tried region in regions_rmdirs()

2026-06-09T10:26:04+00:00

[ Upstream commit 441f92f7d386b85bad16de49db95a307cba048a2 ] DAMON sysfs maintains the DAMOS tried region directory objects via a linked list. When the user requests refresh of the directories, DAMON sysfs removes all the region directories first, and then generate updated regions directory on the empty space. The removal function (damon_sysfs_scheme_regions_rm_dirs()) only puts the kobj objects. Deletion of the container region object from the linked list is done inside the kobj release callback function. If somehow the callback invocation is delayed, the list will contain regions list that gonna be freed. If the updated region directories creation is started in this situation, the list can be corrupted and use-after-free can happen. Because the kobj objects are managed by only DAMON sysfs, the issue cannot happen in normal situation. But, such delays can be made on kernels that built with CONFIG_DEBUG_KOBJECT_RELEASE. On the kernel, the issue can indeed be reproduced like below. # damo start --damos_action stat # cd /sys/kernel/mm/damon/admin/kdamonds/0/ # for i in {1..10}; do echo update_schemes_tried_regions > state; done # dmesg | grep underflow [ 89.296152] refcount_t: underflow; use-after-free. Fix the issue by removing the region object from the list when decrementing the reference count. Also update damos_sysfs_populate_region_dir() to add the region object to the list only after the kobject_init_and_add() is success, so that fail of kobject_init_and_add() is not leaving the deallocated object on the list. The issue was discovered [1] by Sashiko. Link: https://lore.kernel.org/20260518152559.93038-1-sj@kernel.org Link: https://lore.kernel.org/20260513011920.119183-1-sj@kernel.org [1] Fixes: 9277d0367ba1 ("mm/damon/sysfs-schemes: implement scheme region directory") Signed-off-by: SeongJae Park Cc: # 6.2.x Signed-off-by: Andrew Morton Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman