kernel/linux.git/mm/memory.c, branch v6.18.21

mm: Fix a hmm_range_fault() livelock / starvation problem

2026-03-19T15:08:49+00:00

commit b570f37a2ce480be26c665345c5514686a8a0274 upstream. If hmm_range_fault() fails a folio_trylock() in do_swap_page, trying to acquire the lock of a device-private folio for migration, to ram, the function will spin until it succeeds grabbing the lock. However, if the process holding the lock is depending on a work item to be completed, which is scheduled on the same CPU as the spinning hmm_range_fault(), that work item might be starved and we end up in a livelock / starvation situation which is never resolved. This can happen, for example if the process holding the device-private folio lock is stuck in migrate_device_unmap()->lru_add_drain_all() sinc lru_add_drain_all() requires a short work-item to be run on all online cpus to complete. A prerequisite for this to happen is: a) Both zone device and system memory folios are considered in migrate_device_unmap(), so that there is a reason to call lru_add_drain_all() for a system memory folio while a folio lock is held on a zone device folio. b) The zone device folio has an initial mapcount > 1 which causes at least one migration PTE entry insertion to be deferred to try_to_migrate(), which can happen after the call to lru_add_drain_all(). c) No or voluntary only preemption. This all seems pretty unlikely to happen, but indeed is hit by the "xe_exec_system_allocator" igt test. Resolve this by waiting for the folio to be unlocked if the folio_trylock() fails in do_swap_page(). Rename migration_entry_wait_on_locked() to softleaf_entry_wait_unlock() and update its documentation to indicate the new use-case. Future code improvements might consider moving the lru_add_drain_all() call in migrate_device_unmap() to be called *after* all pages have migration entries inserted. That would eliminate also b) above. v2: - Instead of a cond_resched() in hmm_range_fault(), eliminate the problem by waiting for the folio to be unlocked in do_swap_page() (Alistair Popple, Andrew Morton) v3: - Add a stub migration_entry_wait_on_locked() for the !CONFIG_MIGRATION case. (Kernel Test Robot) v4: - Rename migrate_entry_wait_on_locked() to softleaf_entry_wait_on_locked() and update docs (Alistair Popple) v5: - Add a WARN_ON_ONCE() for the !CONFIG_MIGRATION version of softleaf_entry_wait_on_locked(). - Modify wording around function names in the commit message (Andrew Morton) Suggested-by: Alistair Popple Fixes: 1afaeb8293c9 ("mm/migrate: Trylock device page in do_swap_page") Cc: Ralph Campbell Cc: Christoph Hellwig Cc: Jason Gunthorpe Cc: Jason Gunthorpe Cc: Leon Romanovsky Cc: Andrew Morton Cc: Matthew Brost Cc: John Hubbard Cc: Alistair Popple Cc: linux-mm@kvack.org Cc: Signed-off-by: Thomas Hellström Cc: # v6.15+ Reviewed-by: John Hubbard #v3 Reviewed-by: Alistair Popple Link: https://patch.msgid.link/20260210115653.92413-1-thomas.hellstrom@linux.intel.com (cherry picked from commit a69d1ab971a624c6f112cea61536569d579c3215) Signed-off-by: Rodrigo Vivi Signed-off-by: Thomas Hellström Signed-off-by: Greg Kroah-Hartman

mm: fix some typos in mm module

2026-01-30T09:32:28+00:00

[ Upstream commit b6c46600bfb28b4be4e9cff7bad4f2cf357e0fb7 ] Below are some typos in the code comments: intevals ==> intervals addesses ==> addresses unavaliable ==> unavailable facor ==> factor droping ==> dropping exlusive ==> exclusive decription ==> description confict ==> conflict desriptions ==> descriptions otherwize ==> otherwise vlaue ==> value cheching ==> checking exisitng ==> existing modifed ==> modified differenciate ==> differentiate refernece ==> reference permissons ==> permissions indepdenent ==> independent spliting ==> splitting Just fix it. Link: https://lkml.kernel.org/r/20250929002608.1633825-1-jianyungao89@gmail.com Signed-off-by: jianyun.gao Reviewed-by: SeongJae Park Reviewed-by: Wei Yang Reviewed-by: Dev Jain Reviewed-by: Liam R. Howlett Acked-by: Chris Li Signed-off-by: Andrew Morton Stable-dep-of: 3937027caecb ("mm/hugetlb: fix two comments related to huge_pmd_unshare()") Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman

mm/memory: do not populate page table entries beyond i_size

2025-11-10T05:19:43+00:00

Patch series "Fix SIGBUS semantics with large folios", v3. Accessing memory within a VMA, but beyond i_size rounded up to the next page size, is supposed to generate SIGBUS. Darrick reported[1] an xfstests regression in v6.18-rc1. generic/749 failed due to missing SIGBUS. This was caused by my recent changes that try to fault in the whole folio where possible: 19773df031bc ("mm/fault: try to map the entire file folio in finish_fault()") 357b92761d94 ("mm/filemap: map entire large folio faultaround") These changes did not consider i_size when setting up PTEs, leading to xfstest breakage. However, the problem has been present in the kernel for a long time - since huge tmpfs was introduced in 2016. The kernel happily maps PMD-sized folios as PMD without checking i_size. And huge=always tmpfs allocates PMD-size folios on any writes. I considered this corner case when I implemented a large tmpfs, and my conclusion was that no one in their right mind should rely on receiving a SIGBUS signal when accessing beyond i_size. I cannot imagine how it could be useful for the workload. But apparently filesystem folks care a lot about preserving strict SIGBUS semantics. Generic/749 was introduced last year with reference to POSIX, but no real workloads were mentioned. It also acknowledged the tmpfs deviation from the test case. POSIX indeed says[3]: References within the address range starting at pa and continuing for len bytes to whole pages following the end of an object shall result in delivery of a SIGBUS signal. The patchset fixes the regression introduced by recent changes as well as more subtle SIGBUS breakage due to split failure on truncation. This patch (of 2): Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are supposed to generate SIGBUS. Recent changes attempted to fault in full folio where possible. They did not respect i_size, which led to populating PTEs beyond i_size and breaking SIGBUS semantics. Darrick reported generic/749 breakage because of this. However, the problem existed before the recent changes. With huge=always tmpfs, any write to a file leads to PMD-size allocation. Following the fault-in of the folio will install PMD mapping regardless of i_size. Fix filemap_map_pages() and finish_fault() to not install: - PTEs beyond i_size; - PMD mappings across i_size; Make an exception for shmem/tmpfs that for long time intentionally mapped with PMDs across i_size. Link: https://lkml.kernel.org/r/20251027115636.82382-1-kirill@shutemov.name Link: https://lkml.kernel.org/r/20251027115636.82382-2-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau Fixes: 6795801366da ("xfs: Support large folios") Reported-by: "Darrick J. Wong" Cc: Al Viro Cc: Baolin Wang Cc: Christian Brauner Cc: Dave Chinner Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Michal Hocko Cc: Mike Rapoport Cc: Rik van Riel Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton

mm/fault: try to map the entire file folio in finish_fault()

2025-09-28T18:51:30+00:00

finish_fault() uses per-page fault for file folios. This only occurs for file folios smaller than PMD_SIZE. The comment suggests that this approach prevents RSS inflation. However, it only prevents RSS accounting. The folio is still mapped to the process, and the fact that it is mapped by a single PTE does not affect memory pressure. Additionally, the kernel's ability to map large folios as PMD if they are large enough does not support this argument. When possible, map large folios in one shot. This reduces the number of minor page faults and allows for TLB coalescing. Mapping large folios at once will allow the rmap code to mlock it on add, as it will recognize that it is fully mapped and mlocking is safe. Link: https://lkml.kernel.org/r/20250923110711.690639-5-kirill@shutemov.name Signed-off-by: Kiryl Shutsemau Reviewed-by: Shakeel Butt Reviewed-by: Baolin Wang Cc: David Hildenbrand Cc: Johannes Weiner Cc: Lorenzo Stoakes Signed-off-by: Andrew Morton

mm: remove redundant test in validate_page_before_insert()

2025-09-21T21:22:32+00:00

The page_has_type() call would have included slab since commit 46df8e73a4a3 and now we don't even get that far because slab pages have a zero refcount since commit 9aec2fb0fd5e. Link: https://lkml.kernel.org/r/20250910142923.2465470-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Signed-off-by: Andrew Morton

mm, swap: cleanup swap cache API and add kerneldoc

2025-09-21T21:22:23+00:00

In preparation for replacing the swap cache backend with the swap table, clean up and add proper kernel doc for all swap cache APIs. Now all swap cache APIs are well-defined with consistent names. No feature change, only renaming and documenting. Link: https://lkml.kernel.org/r/20250916160100.31545-9-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: Chris Li Reviewed-by: Barry Song Reviewed-by: Baolin Wang Acked-by: David Hildenbrand Suggested-by: Chris Li Cc: Baoquan He Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Nhat Pham Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm, swap: always lock and check the swap cache folio before use

2025-09-21T21:22:23+00:00

Swap cache lookup only increases the reference count of the returned folio. That's not enough to ensure a folio is stable in the swap cache, so the folio could be removed from the swap cache at any time. The caller should always lock and check the folio before using it. We have just documented this in kerneldoc, now introduce a helper for swap cache folio verification with proper sanity checks. Also, sanitize a few current users to use this convention and the new helper for easier debugging. They were not having observable problems yet, only trivial issues like wasted CPU cycles on swapoff or reclaiming. They would fail in some other way, but it is still better to always follow this convention to make things robust and make later commits easier to do. Link: https://lkml.kernel.org/r/20250916160100.31545-6-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: David Hildenbrand Acked-by: Chris Li Acked-by: Nhat Pham Suggested-by: Chris Li Reviewed-by: Barry Song Cc: Baolin Wang Cc: Baoquan He Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm, swap: check page poison flag after locking it

2025-09-21T21:22:22+00:00

Instead of checking the poison flag only in the fast swap cache lookup path, always check the poison flags after locking a swap cache folio. There are two reasons to do so. The folio is unstable and could be removed from the swap cache anytime, so it's totally possible that the folio is no longer the backing folio of a swap entry, and could be an irrelevant poisoned folio. We might mistakenly kill a faulting process. And it's totally possible or even common for the slow swap in path (swapin_readahead) to bring in a cached folio. The cache folio could be poisoned, too. Only checking the poison flag in the fast path will miss such folios. The race window is tiny, so it's very unlikely to happen, though. While at it, also add a unlikely prefix. Link: https://lkml.kernel.org/r/20250916160100.31545-5-ryncsn@gmail.com Signed-off-by: Kairui Song Acked-by: Chris Li Acked-by: David Hildenbrand Acked-by: Nhat Pham Suggested-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm, swap: use unified helper for swap cache look up

2025-09-21T21:22:22+00:00

The swap cache lookup helper swap_cache_get_folio currently does readahead updates as well, so callers that are not doing swapin from any VMA or mapping are forced to reuse filemap helpers instead, and have to access the swap cache space directly. So decouple readahead update with swap cache lookup. Move the readahead update part into a standalone helper. Let the caller call the readahead update helper if they do readahead. And convert all swap cache lookups to use swap_cache_get_folio. After this commit, there are only three special cases for accessing swap cache space now: huge memory splitting, migration, and shmem replacing, because they need to lock the XArray. The following commits will wrap their accesses to the swap cache too, with special helpers. And worth noting, currently dropbehind is not supported for anon folio, and we will never see a dropbehind folio in swap cache. The unified helper can be updated later to handle that. While at it, add proper kernedoc for touched helpers. No functional change. Link: https://lkml.kernel.org/r/20250916160100.31545-3-ryncsn@gmail.com Signed-off-by: Kairui Song Reviewed-by: Baolin Wang Reviewed-by: Barry Song Acked-by: David Hildenbrand Acked-by: Chris Li Acked-by: Nhat Pham Suggested-by: Chris Li Cc: Baoquan He Cc: "Huang, Ying" Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: kernel test robot Cc: Lorenzo Stoakes Cc: Matthew Wilcox (Oracle) Cc: Yosry Ahmed Cc: Zi Yan Cc: SeongJae Park Signed-off-by: Andrew Morton

mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED

2025-09-13T23:55:05+00:00

Let's allow for making MADV_COLLAPSE succeed on areas that neither have VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED). MADV_COLLAPSE is a clear advice that we want to collapse. Note that we still respect the VM_NOHUGEPAGE flag, just like MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED, including for shmem. Link: https://lkml.kernel.org/r/20250815135549.130506-4-usamaarif642@gmail.com Co-developed-by: Usama Arif Signed-off-by: Usama Arif Signed-off-by: David Hildenbrand Reviewed-by: Baolin Wang Reviewed-by: Lorenzo Stoakes Reviewed-by: Zi Yan Cc: Arnd Bergmann Cc: Barry Song Cc: Dev Jain Cc: Jann Horn Cc: Johannes Weiner Cc: Jonathan Corbet Cc: Liam Howlett Cc: Mariano Pache Cc: Michal Hocko Cc: Mike Rapoport Cc: Rik van Riel Cc: Ryan Roberts Cc: SeongJae Park Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Yafang Signed-off-by: Andrew Morton