summaryrefslogtreecommitdiff
path: root/mm/internal.h
AgeCommit message (Collapse)AuthorFilesLines
2024-12-09Rename .data.once to .data..once to fix resetting WARN*_ONCEMasahiro Yamada1-1/+1
[ Upstream commit dbefa1f31a91670c9e7dac9b559625336206466f ] Commit b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") added support for clearing the state of once warnings. However, it is not functional when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION or CONFIG_LTO_CLANG is enabled, because .data.once matches the .data.[0-9a-zA-Z_]* pattern in the DATA_MAIN macro. Commit cb87481ee89d ("kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured") was introduced to suppress the issue for the default CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=n case, providing a minimal fix for stable backporting. We were aware this did not address the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y. The plan was to apply correct fixes and then revert cb87481ee89d. [1] Seven years have passed since then, yet the #ifdef workaround remains in place. Meanwhile, commit b1fca27d384e introduced the .data.once section, and commit dc5723b02e52 ("kbuild: add support for Clang LTO") extended the #ifdef. Using a ".." separator in the section name fixes the issue for CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and CONFIG_LTO_CLANG. [1]: https://lore.kernel.org/linux-kbuild/CAK7LNASck6BfdLnESxXUeECYL26yUDm0cwRZuM4gmaWUkxjL5g@mail.gmail.com/ Fixes: b1fca27d384e ("kernel debug: support resetting WARN*_ONCE") Fixes: dc5723b02e52 ("kbuild: add support for Clang LTO") Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-11-22mm: unconditionally close VMAs on errorLorenzo Stoakes1-0/+18
[ Upstream commit 4080ef1579b2413435413988d14ac8c68e4d42c8 ] Incorrect invocation of VMA callbacks when the VMA is no longer in a consistent state is bug prone and risky to perform. With regards to the important vm_ops->close() callback We have gone to great lengths to try to track whether or not we ought to close VMAs. Rather than doing so and risking making a mistake somewhere, instead unconditionally close and reset vma->vm_ops to an empty dummy operations set with a NULL .close operator. We introduce a new function to do so - vma_close() - and simplify existing vms logic which tracked whether we needed to close or not. This simplifies the logic, avoids incorrect double-calling of the .close() callback and allows us to update error paths to simply call vma_close() unconditionally - making VMA closure idempotent. Link: https://lkml.kernel.org/r/28e89dda96f68c505cb6f8e9fc9b57c3e9f74b42.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-22mm: avoid unsafe VMA hook invocation when error arises on mmap hookLorenzo Stoakes1-0/+27
[ Upstream commit 3dd6ed34ce1f2356a77fb88edafb5ec96784e3cf ] Patch series "fix error handling in mmap_region() and refactor (hotfixes)", v4. mmap_region() is somewhat terrifying, with spaghetti-like control flow and numerous means by which issues can arise and incomplete state, memory leaks and other unpleasantness can occur. A large amount of the complexity arises from trying to handle errors late in the process of mapping a VMA, which forms the basis of recently observed issues with resource leaks and observable inconsistent state. This series goes to great lengths to simplify how mmap_region() works and to avoid unwinding errors late on in the process of setting up the VMA for the new mapping, and equally avoids such operations occurring while the VMA is in an inconsistent state. The patches in this series comprise the minimal changes required to resolve existing issues in mmap_region() error handling, in order that they can be hotfixed and backported. There is additionally a follow up series which goes further, separated out from the v1 series and sent and updated separately. This patch (of 5): After an attempted mmap() fails, we are no longer in a situation where we can safely interact with VMA hooks. This is currently not enforced, meaning that we need complicated handling to ensure we do not incorrectly call these hooks. We can avoid the whole issue by treating the VMA as suspect the moment that the file->f_ops->mmap() function reports an error by replacing whatever VMA operations were installed with a dummy empty set of VMA operations. We do so through a new helper function internal to mm - mmap_file() - which is both more logically named than the existing call_mmap() function and correctly isolates handling of the vm_op reassignment to mm. All the existing invocations of call_mmap() outside of mm are ultimately nested within the call_mmap() from mm, which we now replace. It is therefore safe to leave call_mmap() in place as a convenience function (and to avoid churn). The invokers are: ovl_file_operations -> mmap -> ovl_mmap() -> backing_file_mmap() coda_file_operations -> mmap -> coda_file_mmap() shm_file_operations -> shm_mmap() shm_file_operations_huge -> shm_mmap() dma_buf_fops -> dma_buf_mmap_internal -> i915_dmabuf_ops -> i915_gem_dmabuf_mmap() None of these callers interact with vm_ops or mappings in a problematic way on error, quickly exiting out. Link: https://lkml.kernel.org/r/cover.1730224667.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/d41fd763496fd0048a962f3fd9407dc72dd4fd86.1730224667.git.lorenzo.stoakes@oracle.com Fixes: deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reported-by: Jann Horn <jannh@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jann Horn <jannh@google.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Helge Deller <deller@gmx.de> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Brown <broonie@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm/thp: fix deferred split unqueue naming and lockingHugh Dickins1-5/+5
commit f8f931bba0f92052cf842b7e30917b1afcc77d5a upstream. Recent changes are putting more pressure on THP deferred split queues: under load revealing long-standing races, causing list_del corruptions, "Bad page state"s and worse (I keep BUGs in both of those, so usually don't get to see how badly they end up without). The relevant recent changes being 6.8's mTHP, 6.10's mTHP swapout, and 6.12's mTHP swapin, improved swap allocation, and underused THP splitting. Before fixing locking: rename misleading folio_undo_large_rmappable(), which does not undo large_rmappable, to folio_unqueue_deferred_split(), which is what it does. But that and its out-of-line __callee are mm internals of very limited usability: add comment and WARN_ON_ONCEs to check usage; and return a bool to say if a deferred split was unqueued, which can then be used in WARN_ON_ONCEs around safety checks (sparing callers the arcane conditionals in __folio_unqueue_deferred_split()). Just omit the folio_unqueue_deferred_split() from free_unref_folios(), all of whose callers now call it beforehand (and if any forget then bad_page() will tell) - except for its caller put_pages_list(), which itself no longer has any callers (and will be deleted separately). Swapout: mem_cgroup_swapout() has been resetting folio->memcg_data 0 without checking and unqueueing a THP folio from deferred split list; which is unfortunate, since the split_queue_lock depends on the memcg (when memcg is enabled); so swapout has been unqueueing such THPs later, when freeing the folio, using the pgdat's lock instead: potentially corrupting the memcg's list. __remove_mapping() has frozen refcount to 0 here, so no problem with calling folio_unqueue_deferred_split() before resetting memcg_data. That goes back to 5.4 commit 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware"): which included a check on swapcache before adding to deferred queue, but no check on deferred queue before adding THP to swapcache. That worked fine with the usual sequence of events in reclaim (though there were a couple of rare ways in which a THP on deferred queue could have been swapped out), but 6.12 commit dafff3f4c850 ("mm: split underused THPs") avoids splitting underused THPs in reclaim, which makes swapcache THPs on deferred queue commonplace. Keep the check on swapcache before adding to deferred queue? Yes: it is no longer essential, but preserves the existing behaviour, and is likely to be a worthwhile optimization (vmstat showed much more traffic on the queue under swapping load if the check was removed); update its comment. Memcg-v1 move (deprecated): mem_cgroup_move_account() has been changing folio->memcg_data without checking and unqueueing a THP folio from the deferred list, sometimes corrupting "from" memcg's list, like swapout. Refcount is non-zero here, so folio_unqueue_deferred_split() can only be used in a WARN_ON_ONCE to validate the fix, which must be done earlier: mem_cgroup_move_charge_pte_range() first try to split the THP (splitting of course unqueues), or skip it if that fails. Not ideal, but moving charge has been requested, and khugepaged should repair the THP later: nobody wants new custom unqueueing code just for this deprecated case. The 87eaceb3faa5 commit did have the code to move from one deferred list to another (but was not conscious of its unsafety while refcount non-0); but that was removed by 5.6 commit fac0516b5534 ("mm: thp: don't need care deferred split queue in memcg charge move path"), which argued that the existence of a PMD mapping guarantees that the THP cannot be on a deferred list. As above, false in rare cases, and now commonly false. Backport to 6.11 should be straightforward. Earlier backports must take care that other _deferred_list fixes and dependencies are included. There is not a strong case for backports, but they can fix cornercases. Link: https://lkml.kernel.org/r/8dc111ae-f6db-2da7-b25c-7a20b1effe3b@google.com Fixes: 87eaceb3faa5 ("mm: thp: make deferred split shrinker memcg aware") Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Upstream commit itself does not apply cleanly, because there are fewer calls to folio_undo_large_rmappable() in this tree (in particular, folio migration does not migrate memcg charge), and mm/memcontrol-v1.c has not been split out of mm/memcontrol.c. ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm: refactor folio_undo_large_rmappable()Kefeng Wang1-1/+16
commit 593a10dabe08dcf93259fce2badd8dc2528859a8 upstream. Folios of order <= 1 are not in deferred list, the check of order is added into folio_undo_large_rmappable() from commit 8897277acfef ("mm: support order-1 folios in the page cache"), but there is a repeated check for small folio (order 0) during each call of the folio_undo_large_rmappable(), so only keep folio_order() check inside the function. In addition, move all the checks into header file to save a function call for non-large-rmappable or empty deferred_list folio. Link: https://lkml.kernel.org/r/20240521130315.46072-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Upstream commit itself does not apply cleanly, because there are fewer calls to folio_undo_large_rmappable() in this tree. ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm: always initialise folio->_deferred_listMatthew Wilcox (Oracle)1-0/+2
commit b7b098cf00a2b65d5654a86dc8edf82f125289c1 upstream. Patch series "Various significant MM patches". These patches all interact in annoying ways which make it tricky to send them out in any way other than a big batch, even though there's not really an overarching theme to connect them. The big effects of this patch series are: - folio_test_hugetlb() becomes reliable, even when called without a page reference - We free up PG_slab, and we could always use more page flags - We no longer need to check PageSlab before calling page_mapcount() This patch (of 9): For compound pages which are at least order-2 (and hence have a deferred_list), initialise it and then we can check at free that the page is not part of a deferred list. We recently found this useful to rule out a source of corruption. [peterx@redhat.com: always initialise folio->_deferred_list] Link: https://lkml.kernel.org/r/20240417211836.2742593-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20240321142448.1645400-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240321142448.1645400-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Include three small changes from the upstream commit, for backport safety: replace list_del() by list_del_init() in split_huge_page_to_list(), like c010d47f107f ("mm: thp: split huge page to any lower order pages"); replace list_del() by list_del_init() in folio_undo_large_rmappable(), like 9bcef5973e31 ("mm: memcg: fix split queue list crash when large folio migration"); keep __free_pages() instead of folio_put() in __update_and_free_hugetlb_folio(). ] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm: support order-1 folios in the page cacheMatthew Wilcox (Oracle)1-2/+1
commit 8897277acfef7f70fdecc054073bea2542fc7a1b upstream. Folios of order 1 have no space to store the deferred list. This is not a problem for the page cache as file-backed folios are never placed on the deferred list. All we need to do is prevent the core MM from touching the deferred list for order 1 folios and remove the code which prevented us from allocating order 1 folios. Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/ Link: https://lkml.kernel.org/r/20240226205534.1603748-3-zi.yan@sent.com Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-11-17mm: add page_rmappable_folio() wrapperHugh Dickins1-0/+9
commit 23e4883248f0472d806c8b3422ba6257e67bf1a5 upstream. folio_prep_large_rmappable() is being used repeatedly along with a conversion from page to folio, a check non-NULL, a check order > 1: wrap it all up into struct folio *page_rmappable_folio(struct page *). Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-08-19mm: gup: stop abusing try_grab_folioYang Shi1-2/+2
commit f442fa6141379a20b48ae3efabee827a3d260787 upstream. A kernel warning was reported when pinning folio in CMA memory when launching SEV virtual machine. The splat looks like: [ 464.325306] WARNING: CPU: 13 PID: 6734 at mm/gup.c:1313 __get_user_pages+0x423/0x520 [ 464.325464] CPU: 13 PID: 6734 Comm: qemu-kvm Kdump: loaded Not tainted 6.6.33+ #6 [ 464.325477] RIP: 0010:__get_user_pages+0x423/0x520 [ 464.325515] Call Trace: [ 464.325520] <TASK> [ 464.325523] ? __get_user_pages+0x423/0x520 [ 464.325528] ? __warn+0x81/0x130 [ 464.325536] ? __get_user_pages+0x423/0x520 [ 464.325541] ? report_bug+0x171/0x1a0 [ 464.325549] ? handle_bug+0x3c/0x70 [ 464.325554] ? exc_invalid_op+0x17/0x70 [ 464.325558] ? asm_exc_invalid_op+0x1a/0x20 [ 464.325567] ? __get_user_pages+0x423/0x520 [ 464.325575] __gup_longterm_locked+0x212/0x7a0 [ 464.325583] internal_get_user_pages_fast+0xfb/0x190 [ 464.325590] pin_user_pages_fast+0x47/0x60 [ 464.325598] sev_pin_memory+0xca/0x170 [kvm_amd] [ 464.325616] sev_mem_enc_register_region+0x81/0x130 [kvm_amd] Per the analysis done by yangge, when starting the SEV virtual machine, it will call pin_user_pages_fast(..., FOLL_LONGTERM, ...) to pin the memory. But the page is in CMA area, so fast GUP will fail then fallback to the slow path due to the longterm pinnalbe check in try_grab_folio(). The slow path will try to pin the pages then migrate them out of CMA area. But the slow path also uses try_grab_folio() to pin the page, it will also fail due to the same check then the above warning is triggered. In addition, the try_grab_folio() is supposed to be used in fast path and it elevates folio refcount by using add ref unless zero. We are guaranteed to have at least one stable reference in slow path, so the simple atomic add could be used. The performance difference should be trivial, but the misuse may be confusing and misleading. Redefined try_grab_folio() to try_grab_folio_fast(), and try_grab_page() to try_grab_folio(), and use them in the proper paths. This solves both the abuse and the kernel warning. The proper naming makes their usecase more clear and should prevent from abusing in the future. peterx said: : The user will see the pin fails, for gpu-slow it further triggers the WARN : right below that failure (as in the original report): : : folio = try_grab_folio(page, page_increm - 1, : foll_flags); : if (WARN_ON_ONCE(!folio)) { <------------------------ here : /* : * Release the 1st page ref if the : * folio is problematic, fail hard. : */ : gup_put_folio(page_folio(page), 1, : foll_flags); : ret = -EFAULT; : goto out; : } [1] https://lore.kernel.org/linux-mm/1719478388-31917-1-git-send-email-yangge1116@126.com/ [shy828301@gmail.com: fix implicit declaration of function try_grab_folio_fast] Link: https://lkml.kernel.org/r/CAHbLzkowMSso-4Nufc9hcMehQsK9PNz3OSu-+eniU-2Mm-xjhA@mail.gmail.com Link: https://lkml.kernel.org/r/20240628191458.2605553-1-yang@os.amperecomputing.com Fixes: 57edfcfd3419 ("mm/gup: accelerate thp gup even for "pages != NULL"") Signed-off-by: Yang Shi <yang@os.amperecomputing.com> Reported-by: yangge <yangge1116@126.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> [6.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-05-02mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properlyDavid Hildenbrand1-4/+6
[ Upstream commit 631426ba1d45a8672b177ee85ad4cabe760dd131 ] Darrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. While the madvise() call can be interrupted by a signal, this is not the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave like page faults in that case: fail and not retry forever. A reproducer can be found at [1]. The reason is that __get_user_pages(), as called by faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way: it will simply return 0 when VM_FAULT_RETRY happened, making madvise_populate()->faultin_vma_page_range() retry again and again, never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages(). __get_user_pages_locked() does what we want, but duplicating that logic in faultin_vma_page_range() feels wrong. So let's use __get_user_pages_locked() instead, that will detect VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT from faultin_page() to __get_user_pages(), all the way to madvise_populate(). But, there is an issue: __get_user_pages_locked() will end up re-taking the MM lock and then __get_user_pages() will do another VMA lookup. In the meantime, the VMA layout could have changed and we'd fail with different error codes than we'd want to. As __get_user_pages() will currently do a new VMA lookup either way, let it do the VMA handling in a different way, controlled by a new FOLL_MADV_POPULATE flag, effectively moving these checks from madvise_populate() + faultin_page_range() in there. With this change, Darricks reproducer properly fails with -EFAULT, as documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE. [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/ Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Darrick J. Wong <djwong@kernel.org> Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/ Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-05-02mm/gup: explicitly define and check internal GUP flags, disallow FOLL_TOUCHLorenzo Stoakes1-0/+3
[ Upstream commit 0f20bba1688bdf3b32df0162511a67d4eda15790 ] Rather than open-coding a list of internal GUP flags in is_valid_gup_args(), define which ones are internal. In addition, explicitly check to see if the user passed in FOLL_TOUCH somehow, as this appears to have been accidentally excluded. Link: https://lkml.kernel.org/r/971e013dfe20915612ea8b704e801d7aef9a66b6.1696288092.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: 631426ba1d45 ("mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-31Merge tag 'x86_shstk_for_6.6-rc1' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...
2023-08-22mm: free up a word in the first tail pageMatthew Wilcox (Oracle)1-1/+1
Store the folio order in the low byte of the flags word in the first tail page. This frees up the word that was being used to store the order and dtor bytes previously. Link: https://lkml.kernel.org/r/20230816151201.3655946-11-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-22mm: add large_rmappable page flagMatthew Wilcox (Oracle)1-1/+0
Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-22mm: convert free_transhuge_folio() to folio_undo_large_rmappable()Matthew Wilcox (Oracle)1-0/+2
Indirect calls are expensive, thanks to Spectre. Test for TRANSHUGE_PAGE_DTOR and destroy the folio appropriately. Move the free_compound_page() call into destroy_large_folio() to simplify later patches. Link: https://lkml.kernel.org/r/20230816151201.3655946-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-22merge mm-hotfixes-stable into mm-stable to pick up depended-upon changesAndrew Morton1-0/+17
2023-08-21mm: allow fault_dirty_shared_page() to be called under the VMA lockMatthew Wilcox (Oracle)1-1/+1
By making maybe_unlock_mmap_for_io() handle the VMA lock correctly, we make fault_dirty_shared_page() safe to be called without the mmap lock held. Link: https://lkml.kernel.org/r/20230812002033.1002367-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: David Hildenbrand <david@redhat.com> Tested-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm: disable kernelcore=mirror when no mirror memoryMa Wupeng1-0/+1
For system with kernelcore=mirror enabled while no mirrored memory is reported by efi. This could lead to kernel OOM during startup since all memory beside zone DMA are in the movable zone and this prevents the kernel to use it. Zone DMA/DMA32 initialization is independent of mirrored memory and their max pfn is set in zone_sizes_init(). Since kernel can fallback to zone DMA/DMA32 if there is no memory in zone Normal, these zones are seen as mirrored memory no mather their memory attributes are. To solve this problem, disable kernelcore=mirror when there is no real mirrored memory exists. Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Suggested-by: Mike Rapoport <rppt@kernel.org> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via ↵David Hildenbrand1-0/+10
GUP-fast In contrast to most other GUP code, GUP-fast common page table walking code like gup_pte_range() also handles hugetlb pages. But in contrast to other hugetlb page table walking code, it does not look at the hugetlb PTE abstraction whereby we have only a single logical hugetlb PTE per hugetlb page, even when using multiple cont-PTEs underneath -- which is for example what huge_ptep_get() abstracts. So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast might stumble over a PTE that does not map the head page of a hugetlb page -- not the first "head" PTE of such a cont mapping. Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we might end up calling gup_must_unshare() with a tail page of a hugetlb page. We only maintain a single PageAnonExclusive flag per hugetlb page (as hugetlb pages cannot get partially COW-shared), stored for the head page. That flag is clear for all tail pages. So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail page of a hugetlb page: 1) With CONFIG_DEBUG_VM_PGFLAGS Stumbles over the: VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); For example, when executing the COW selftests with 64k hugetlb pages on arm64: [ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11 [ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2 [ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff) [ 61.084101] page_type: 0xffffffff() [ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000 [ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 [ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71 [ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000 [ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page)) [ 61.086914] ------------[ cut here ]------------ [ 61.087220] kernel BUG at include/linux/page-flags.h:990! [ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 61.087999] Modules linked in: ... [ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3 [ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98 [ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98 [ 61.091592] sp : ffff8000825eb940 [ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440 [ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000 [ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58 [ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff [ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548 [ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021 [ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0 [ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8 [ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000 [ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046 [ 61.096894] Call trace: [ 61.097080] gup_must_unshare.part.0+0x64/0x98 [ 61.097392] gup_pte_range+0x3a8/0x3f0 [ 61.097662] gup_pgd_range+0x1ec/0x280 [ 61.097942] lockless_pages_from_mm+0x64/0x1a0 [ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0 [ 61.098612] pin_user_pages_fast+0x58/0x78 [ 61.098917] pin_longterm_test_start+0xf4/0x2b8 [ 61.099243] gup_test_ioctl+0x170/0x3b0 [ 61.099528] __arm64_sys_ioctl+0xa8/0xf0 [ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0 [ 61.100160] el0_svc_common.constprop.0+0xe8/0x100 [ 61.100500] do_el0_svc+0x38/0xa0 [ 61.100736] el0_svc+0x3c/0x198 [ 61.100971] el0t_64_sync_handler+0x134/0x150 [ 61.101280] el0t_64_sync+0x17c/0x180 [ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000) 2) Without CONFIG_DEBUG_VM_PGFLAGS Always detects "not exclusive" for passed tail pages and refuses to PIN the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback to ordinary GUP. As ordinary GUP properly considers the logical hugetlb PTE abstraction in hugetlb_follow_page_mask(), pinning the page will succeed when looking at the PageAnonExclusive on the head page only. So the only real effect of this is that with cont-PTE hugetlb pages, we'll always fallback from GUP-fast to ordinary GUP when not working on the head page, which ends up checking the head page and do the right thing. Consequently, the cow selftests pass with cont-PTE hugetlb pages as well without CONFIG_DEBUG_VM_PGFLAGS. Note that this only applies to anon hugetlb pages that are mapped using cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel. ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into the page table R/O using GUP-fast. On production kernels (and even most debug kernels, that don't set CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required to be backported. But of course, it does not hurt. Link: https://lkml.kernel.org/r/20230805101256.87306-1-david@redhat.com Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand1-0/+7
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/20230803143208.383663-3-david@redhat.com Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: set up vma iterator for vma_iter_prealloc() callsLiam R. Howlett1-11/+7
Set the correct limits for vma_iter_prealloc() calls so that the maple tree can be smarter about how many nodes are needed. Link: https://lkml.kernel.org/r/20230724183157.3939892-11-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: use vma_iter_clear_gfp() in nommuLiam R. Howlett1-0/+12
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so it can be used in the nommu code. This will reduce node preallocations in nommu. Link: https://lkml.kernel.org/r/20230724183157.3939892-10-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18maple_tree: re-introduce entry to mas_preallocate() argumentsLiam R. Howlett1-1/+1
The current preallocation strategy is to preallocate the absolute worst-case allocation for a tree modification. The entry (or NULL) is needed to know how many nodes are needed to write to the tree. Start by adding the argument to the mas_preallocate() definition. Link: https://lkml.kernel.org/r/20230724183157.3939892-8-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: remove re-walk from mmap_region()Liam R. Howlett1-0/+8
Using vma_iter_set() will reset the tree and cause a re-walk. Use vmi_iter_config() to set the write to a sub-set of the range. Change the file case to also use vmi_iter_config() so that the end is correctly set. Link: https://lkml.kernel.org/r/20230724183157.3939892-7-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: change do_vmi_align_munmap() tracking of VMAs to removeLiam R. Howlett1-1/+1
The majority of the calls to munmap a vm range is within a single vma. The maple tree is able to store a single entry at 0, with a size of 1 as a pointer and avoid any allocations. Change do_vmi_align_munmap() to store the VMAs being munmap()'ed into a tree indexed by the count. This will leverage the ability to store the first entry without a node allocation. Storing the entries into a tree by the count and not the vma start and end means changing the functions which iterate over the entries. Update unmap_vmas() and free_pgtables() to take a maple state and a tree end address to support this functionality. Passing through the same maple state to unmap_vmas() and free_pgtables() means the state needs to be reset between calls. This happens in the static unmap_region() and exit_mmap(). Link: https://lkml.kernel.org/r/20230724183157.3939892-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm, netfs, fscache: stop read optimisation when folio removed from pagecacheDavid Howells1-1/+4
Fscache has an optimisation by which reads from the cache are skipped until we know that (a) there's data there to be read and (b) that data isn't entirely covered by pages resident in the netfs pagecache. This is done with two flags manipulated by fscache_note_page_release(): if (... test_bit(FSCACHE_COOKIE_HAVE_DATA, &cookie->flags) && test_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags)) clear_bit(FSCACHE_COOKIE_NO_DATA_TO_READ, &cookie->flags); where the NO_DATA_TO_READ flag causes cachefiles_prepare_read() to indicate that netfslib should download from the server or clear the page instead. The fscache_note_page_release() function is intended to be called from ->releasepage() - but that only gets called if PG_private or PG_private_2 is set - and currently the former is at the discretion of the network filesystem and the latter is only set whilst a page is being written to the cache, so sometimes we miss clearing the optimisation. Fix this by following Willy's suggestion[1] and adding an address_space flag, AS_RELEASE_ALWAYS, that causes filemap_release_folio() to always call ->release_folio() if it's set, even if PG_private or PG_private_2 aren't set. Note that this would require folio_test_private() and page_has_private() to become more complicated. To avoid that, in the places[*] where these are used to conditionalise calls to filemap_release_folio() and try_to_release_page(), the tests are removed the those functions just jumped to unconditionally and the test is performed there. [*] There are some exceptions in vmscan.c where the check guards more than just a call to the releaser. I've added a function, folio_needs_release() to wrap all the checks for that. AS_RELEASE_ALWAYS should be set if a non-NULL cookie is obtained from fscache and cleared in ->evict_inode() before truncate_inode_pages_final() is called. Additionally, the FSCACHE_COOKIE_NO_DATA_TO_READ flag needs to be cleared and the optimisation cancelled if a cachefiles object already contains data when we open it. [dwysocha@redhat.com: call folio_mapping() inside folio_needs_release()] Link: https://github.com/DaveWysochanskiRH/kernel/commit/902c990e311120179fa5de99d68364b2947b79ec Link: https://lkml.kernel.org/r/20230628104852.3391651-3-dhowells@redhat.com Fixes: 1f67e6d0b188 ("fscache: Provide a function to note the release of a page") Fixes: 047487c947e8 ("cachefiles: Implement the I/O routines") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com> Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Tested-by: SeongJae Park <sj@kernel.org> Cc: Daire Byrne <daire.byrne@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Xiubo Li <xiubli@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: merge folio_has_private()/filemap_release_folio() call pairsDavid Howells1-0/+8
Patch series "mm, netfs, fscache: Stop read optimisation when folio removed from pagecache", v7. This fixes an optimisation in fscache whereby we don't read from the cache for a particular file until we know that there's data there that we don't have in the pagecache. The problem is that I'm no longer using PG_fscache (aka PG_private_2) to indicate that the page is cached and so I don't get a notification when a cached page is dropped from the pagecache. The first patch merges some folio_has_private() and filemap_release_folio() pairs and introduces a helper, folio_needs_release(), to indicate if a release is required. The second patch is the actual fix. Following Willy's suggestions[1], it adds an AS_RELEASE_ALWAYS flag to an address_space that will make filemap_release_folio() always call ->release_folio(), even if PG_private/PG_private_2 aren't set. folio_needs_release() is altered to add a check for this. This patch (of 2): Make filemap_release_folio() check folio_has_private(). Then, in most cases, where a call to folio_has_private() is immediately followed by a call to filemap_release_folio(), we can get rid of the test in the pair. There are a couple of sites in mm/vscan.c that this can't so easily be done. In shrink_folio_list(), there are actually three cases (something different is done for incompletely invalidated buffers), but filemap_release_folio() elides two of them. In shrink_active_list(), we don't have have the folio lock yet, so the check allows us to avoid locking the page unnecessarily. A wrapper function to check if a folio needs release is provided for those places that still need to do it in the mm/ directory. This will acquire additional parts to the condition in a future patch. After this, the only remaining caller of folio_has_private() outside of mm/ is a check in fuse. Link: https://lkml.kernel.org/r/20230628104852.3391651-1-dhowells@redhat.com Link: https://lkml.kernel.org/r/20230628104852.3391651-2-dhowells@redhat.com Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Steve French <sfrench@samba.org> Cc: Shyam Prasad N <nspmangalore@gmail.com> Cc: Rohith Surabattula <rohiths.msft@gmail.com> Cc: Dave Wysochanski <dwysocha@redhat.com> Cc: Dominique Martinet <asmadeus@codewreck.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Xiubo Li <xiubli@redhat.com> Cc: Jingbo Xu <jefflexu@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-18mm: make show_free_areas() staticKefeng Wang1-0/+6
All callers of show_free_areas() pass 0 and NULL, so we can directly use show_mem() instead of show_free_areas(0, NULL), which could make show_free_areas() a static function. Link: https://lkml.kernel.org/r/20230630062253.189440-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-07-12mm/mmap: Add shadow stack pages to memory accountingRick Edgecombe1-2/+2
The x86 Control-flow Enforcement Technology (CET) feature includes a new type of memory called shadow stack. This shadow stack memory has some unusual properties, which requires some core mm changes to function properly. Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Pengfei Xu <pengfei.xu@intel.com> Tested-by: John Allen <john.allen@amd.com> Tested-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/all/20230613001108.3040476-20-rick.p.edgecombe%40intel.com
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds1-39/+48
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-24mm: rename invalidate_mapping_pagevec to mapping_try_invalidateMatthew Wilcox (Oracle)1-2/+2
We don't use pagevecs for the LRU cache any more, and we don't know that the failed invalidations were due to the folio being in an LRU cache. So rename it to be more accurate. Link: https://lkml.kernel.org/r/20230621164557.3510324-12-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/folio: replace set_compound_order with folio_set_orderTarun Sahu1-16/+16
The patch ("mm/folio: Avoid special handling for order value 0 in folio_set_order") [1] removed the need for special handling of order = 0 in folio_set_order. Now, folio_set_order and set_compound_order becomes similar function. This patch removes the set_compound_order and uses folio_set_order instead. [1] https://lore.kernel.org/all/20230609183032.13E08C433D2@smtp.kernel.org/ Link: https://lkml.kernel.org/r/20230612093514.689846-1-tsahu@linux.ibm.com Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> Reviewed-by Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm: remove set_compound_page_dtor()Sidhartha Kumar1-1/+1
All users can use the folio equivalent so this function can be safely removed. Link: https://lkml.kernel.org/r/20230612163405.99345-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-20mm/folio: avoid special handling for order value 0 in folio_set_orderTarun Sahu1-6/+2
folio_set_order(folio, 0) is used in kernel at two places __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. Currently, It is called to clear out the folio->_folio_nr_pages and folio->_folio_order. For __destroy_compound_gigantic_folio: In past, folio_set_order(folio, 0) was needed because page->mapping used to overlap with _folio_nr_pages and _folio_order. So if these fields were left uncleared during freeing gigantic hugepages, they were causing "BUG: bad page state" due to non-zero page->mapping. Now, After Commit a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") page->mapping has explicitly been cleared out for tail pages. Also, _folio_order and _folio_nr_pages no longer overlaps with page->mapping. So, folio_set_order(folio, 0) can be removed from freeing gigantic folio path (__destroy_compound_gigantic_folio). Another place, folio_set_order(folio, 0) is called inside __prep_compound_gigantic_folio during error path. Here, folio_set_order(folio, 0) can also be removed if we move folio_set_order(folio, order) after for loop. The patch also moves _folio_set_head call in __prep_compound_gigantic_folio() such that we avoid clearing them in the error path. Also, as Mike pointed out: "It would actually be better to move the calls _folio_set_head and folio_set_order in __prep_compound_gigantic_folio() as suggested here. Why? In the current code, the ref count on the 'head page' is still 1 (or more) while those calls are made. So, someone could take a speculative ref on the page BEFORE the tail pages are set up." This way, folio_set_order(folio, 0) is no more needed. And it will also helps removing the confusion of folio order being set to 0 (as _folio_order field is part of first tail page). Testing: I have run LTP tests, which all passes. and also I have written the test in LTP which tests the bug caused by compound_nr and page->mapping overlapping. https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/hugetlb/hugemmap/hugemmap32.c Running on older kernel ( < 5.10-rc7) with the above bug this fails while on newer kernel and, also with this patch it passes. Link: https://lkml.kernel.org/r/20230609162907.111756-1-tsahu@linux.ibm.com Signed-off-by: Tarun Sahu <tsahu@linux.ibm.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/mlock: rename mlock_future_check() to mlock_future_ok()Andrew Morton1-1/+1
It is felt that the name mlock_future_check() is vague - it doesn't particularly convey the function's operation. mlock_future_ok() is a clearer name for a predicate function. Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm/mmap: refactor mlock_future_check()Lorenzo Stoakes1-2/+2
In all but one instance, mlock_future_check() is treated as a boolean function despite returning an error code. In one instance, this error code is ignored and replaced with -ENOMEM. This is confusing, and the inversion of true -> failure, false -> success is not warranted. Convert the function to a bool, lightly refactor and return true if the check passes, false if not. Link: https://lkml.kernel.org/r/20230522082412.56685-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: update vma_iter_store() to use MAS_WARN_ON()Liam R. Howlett1-11/+10
MAS_WARN_ON() will provide more information on the maple state and can be more useful for debugging. Use this version of WARN_ON() in the debugging code when storing to the tree. Update the printk to a pr_warn(), but this will only be printed when maple tree debug is enabled anyways. Making all print statements into one will keep them together on a busy terminal. Link: https://lkml.kernel.org/r/20230518145544.1722059-19-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: update validate_mm() to use vma iteratorLiam R. Howlett1-1/+2
Use the vma iterator in the validation code and combine the code to check the maple tree into the main validate_mm() function. Introduce a new function vma_iter_dump_tree() to dump the maple tree in hex layout. Replace all calls to validate_mm_mt() with validate_mm(). [Liam.Howlett@oracle.com: update validate_mm() to use vma iterator CONFIG flag] Link: https://lkml.kernel.org/r/20230606183538.588190-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20230518145544.1722059-18-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10maple_tree: add format option to mt_dump()Liam R. Howlett1-2/+2
Allow different formatting strings to be used when dumping the tree. Currently supports hex and decimal. Link: https://lkml.kernel.org/r/20230518145544.1722059-6-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: David Binderman <dcb314@hotmail.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vernon Yang <vernon2gm@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: page_alloc: move is_check_pages_enabled() into page_alloc.cKefeng Wang1-5/+0
The is_check_pages_enabled() only used in page_alloc.c, move it into page_alloc.c, also use it in free_tail_page_prepare(). Link: https://lkml.kernel.org/r/20230516063821.121844-14-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: page_alloc: move sysctls into it own filsKefeng Wang1-0/+11
This moves all page alloc related sysctls to its own file, as part of the kernel/sysctl.c spring cleaning, also move some functions declarations from mm.h into internal.h. Link: https://lkml.kernel.org/r/20230516063821.121844-13-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-10mm: page_alloc: move set_zone_contiguous() into mm_init.cKefeng Wang1-0/+7
set_zone_contiguous() is only used in mm init/hotplug, and clear_zone_contiguous() only used in hotplug, move them from page_alloc.c to the more appropriate file. Link: https://lkml.kernel.org/r/20230516063821.121844-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Len Brown <len.brown@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-05-18x86/mm: Add early_memremap_pgprot_adjust() prototypeArnd Bergmann1-6/+0
early_memremap_pgprot_adjust() is a __weak function with a local prototype, but x86 has a custom implementation that does not see the prototype, causing a W=1 warning: arch/x86/mm/ioremap.c:785:17: error: no previous prototype for 'early_memremap_pgprot_adjust' [-Werror=missing-prototypes] Move the declaration into the global linux/io.h header to avoid this. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://lore.kernel.org/all/20230516193549.544673-19-arnd%40kernel.org
2023-04-22mm: correct arg in reclaim_pages()/reclaim_clean_pages_from_list()Kefeng Wang1-1/+2
Both of them change the arg from page_list to folio_list when convert them to use a folio, but not the declaration, let's correct it, also move the reclaim_pages() from swap.h to internal.h as it only used in mm. Link: https://lkml.kernel.org/r/20230417114807.186786-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviwed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-19mm: apply __must_check to vmap_pages_range_noflush()Alexander Potapenko1-5/+5
To prevent errors when vmap_pages_range_noflush() or __vmap_pages_range_noflush() silently fail (see the link below for an example), annotate them with __must_check so that the callers do not unconditionally assume the mapping succeeded. Link: https://lkml.kernel.org/r/20230413131223.4135168-4-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Link: https://lore.kernel.org/linux-mm/CANX2M5ZRrRA64k0hOif02TjmY9kbbO2aCBPyq79es34RXZ=cAw@mail.gmail.com/ Reviewed-by: Marco Elver <elver@google.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-19mm: move free_area_empty() to mm/internal.hMike Rapoport (IBM)1-0/+5
The free_area_empty() helper is only used inside mm/ so move it there to reduce noise in include/linux/mmzone.h Link: https://lkml.kernel.org/r/20230326160215.2674531-1-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm: conditionally write-lock VMA in free_pgtablesSuren Baghdasaryan1-1/+1
Normally free_pgtables needs to lock affected VMAs except for the case when VMAs were isolated under VMA write-lock. munmap() does just that, isolating while holding appropriate locks and then downgrading mmap_lock and dropping per-VMA locks before freeing page tables. Add a parameter to free_pgtables for such scenario. Link: https://lkml.kernel.org/r/20230227173632.3292573-20-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm: move vmalloc_init() declaration to mm/internal.hMike Rapoport (IBM)1-0/+5
vmalloc_init() is called only from mm_core_init(), there is no need to declare it in include/linux/vmalloc.h Move vmalloc_init() declaration to mm/internal.h Link: https://lkml.kernel.org/r/20230321170513.2401534-14-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm: move mem_init_print_info() to mm_init.cMike Rapoport (IBM)1-0/+1
mem_init_print_info() is only called from mm_core_init(). Move it close to the caller and make it static. Link: https://lkml.kernel.org/r/20230321170513.2401534-12-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-04-06mm: move init_mem_debugging_and_hardening() to mm/mm_init.cMike Rapoport (IBM)1-0/+8
init_mem_debugging_and_hardening() is only called from mm_core_init(). Move it close to the caller, make it static and rename it to mem_debugging_and_hardening_init() for consistency with surrounding convention. Link: https://lkml.kernel.org/r/20230321170513.2401534-10-rppt@kernel.org Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Doug Berger <opendmb@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>