summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2026-01-27zsmalloc: introduce SG-list based object read APISergey Senozhatsky1-0/+4
Currently, zsmalloc performs address linearization on read (which sometimes requires memcpy() to a local buffer). Not all zsmalloc users need a linear address. For example, Crypto API supports SG-list, performing linearization under the hood, if needed. In addition, some compressors can have native SG-list support, completely avoiding the linearization step. Provide an SG-list based zsmalloc read API: - zs_obj_read_sg_begin() - zs_obj_read_sg_end() This API allows callers to obtain an SG representation of the object (one entry for objects that are contained in a single page and two entries for spanning objects), avoiding the need for a bounce buffer and memcpy. [senozhatsky@chromium.org: make zs_obj_read_sg_begin() return void, per Yosry] Link: https://lkml.kernel.org/r/20260117024900.792237-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20260113034645.2729998-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Yosry Ahmed <yosry.ahmed@linux.dev> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Brian Geffon <bgeffon@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/damon/core: introduce [in]active memory ratio damos quota goal metricSeongJae Park1-0/+4
Patch series "mm/damon: advance DAMOS-based LRU sorting". DAMOS_LRU_[DE]PRIO actions were added to DAMOS for more access-aware LRU lists sorting. For simple usage, a specialized kernel module, namely DAMON_LRU_SORT, has also been introduced. After the introduction of the module, DAMON got a few important new features, including the aim-based quota auto-tuning, age tracking, young page filter, and monitoring intervals auto-tuning. Meanwhile, DAMOS-based LRU sorting had no direct updates. Now we show some rooms to advance for DAMOS-based LRU sorting. Firstly, the aim-oriented quota auto-tuning can simplify the LRU sorting parameters tuning. But there is no good auto-tuning target metric for LRU sorting use case. Secondly, the behavior of DAMOS_LRU_[DE]PRIO are not very symmetric. DAMOS_LRU_DEPRIO directly moves the pages to inactive LRU list, while DAMOS_LRU_PRIO only marks the page as accessed, so that the page can not directly but only eventually moved to the active LRU list. Finally, DAMON_LRU_SORT users cannot utilize the modern features that can be useful for them, too. Improve the situation with the following changes. First, introduce a new DAMOS quota auto-tuning target metric for active:inactive memory size ratio. Since LRU sorting is a kind of balancing of active and inactive pages, the active:inactive memory size ratio can be intuitively set. Second, update DAMOS_LRU_[DE]PRIO behaviors to be more intuitive and symmetric, by letting them directly move the pages to [in]active LRU list. Third, update the DAMON_LRU_SORT module user interface to be able to fully utilize the modern features including the [in]active memory size ratio-based quota auto-tuning, young page filter, and monitoring intervals auto-tuning. With these changes, for example, users can now ask DAMON to "find hot/cold memory regions with auto-tuned monitoring intervals, do one more page level access check for found hot/cold memory, and move pages of those to active or inactive LRU lists accordingly, aiming X:Y active to inactive memory ratio." For example, if they know 30% of the memory is better to be protected from reclamation, 30:70 can be set as the target ratio. Test Results ------------ I ran DAMON_LRU_SORT with the features introduced by this series, on a real world server workload. For the active:inactive ratio goal, I set 50:50. I confirmed it achieves the target active:inactive ratio, without manual tuning of the monitoring intervals and the hot/coldness thresholds. The baseline system that was not running the DAMON_LRU_SORT was keeping active:inactive ratio of about 1:10. Note that the test didn't show a clear performance difference, though. I believe that was mainly because the workload was not very memory intensive. Also, whether the 50:50 target ratio was optimum is unclear. Nonetheless, the positive performance impact of the basic LRU sorting idea is already confirmed with the initial DAMON_LRU_SORT introduction patch series. The goal of this patch series is simplifying the parameters tuning of DAMOS-based LRU sorting, and the test confirmed the aimed goals are achieved. Patches Sequence ---------------- First three patches extend DAMOS quota auto-tuning to support [in]active memory ratio target metric type. Those (patches 1-3) introduce new metrics, implement DAMON sysfs support, and update the documentation, respectively. Following patch (patch 4) makes DAMOS_LRU_PRIO action to directly move target pages to active LRU list, instead of only marking them accessed. Following seven patches (patches 5-11) updates DAMON_LRU_SORT to support modern DAMON features. Patch 5 makes it uses not only access frequency but also age at under-quota regions prioritization. Patches 6-11 add the support for young page filtering, active:inactive memory ratio based quota auto-tuning, and monitoring intervals auto-tuning, with appropriate document updates. This patch (of 11): DAMOS_LRU_[DE]PRIO are DAMOS actions for making balance of active and inactive memory size. There is no appropriate DAMOS quota auto-tuning target metric for the use case. Add two new DAMOS quota goal metrics for the purpose, namely DAMOS_QUOTA_[IN]ACTIVE_MEM_BP. Those will represent the ratio of [in]active memory to total (inactive + active) memory. Hence, users will be able to ask DAMON to, for example, "find hot and cold memory, and move pages of those to active and inactive LRU lists, adjusting the hot/cold thresholds aiming 50:50 active:inactive memory ratio." Link: https://lkml.kernel.org/r/20260113152717.70459-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260113152717.70459-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm: cma: add cma_alloc_frozen{_compound}()Kefeng Wang1-20/+6
Introduce cma_alloc_frozen{_compound}() helper to alloc pages without incrementing their refcount, then convert hugetlb cma to use the cma_alloc_frozen_compound() and cma_release_frozen() and remove the unused cma_{alloc,free}_folio(), also move the cma_validate_zones() into mm/internal.h since no outside user. The set_pages_refcounted() is only called to set non-compound pages after above changes, so remove the processing about PageHead. Link: https://lkml.kernel.org/r/20260109093136.1491549-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm: page_alloc: add alloc_contig_frozen_{range,pages}()Kefeng Wang1-31/+21
In order to allocate given range of pages or allocate compound pages without incrementing their refcount, adding two new helper alloc_contig_frozen_{range,pages}() which may be beneficial to some users (eg hugetlb). The new alloc_contig_{range,pages} only take !__GFP_COMP gfp now, and the free_contig_range() is refactored to only free non-compound pages, the only caller to free compound pages in cma_free_folio() is changed accordingly, and the free_contig_frozen_range() is provided to match the alloc_contig_frozen_range(), which is used to free frozen pages. Link: https://lkml.kernel.org/r/20260109093136.1491549-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm: cma: kill cma_pages_valid()Kefeng Wang1-1/+0
Kill cma_pages_valid() which only used in cma_release(), also cleanup code duplication between cma pages valid checking and cma memrange finding. Link: https://lkml.kernel.org/r/20260109093136.1491549-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm: page_alloc: add __split_page()Kefeng Wang1-0/+10
Factor out the splitting of non-compound page from make_alloc_exact() and split_page() into a new helper function __split_page(). While at it, convert the VM_BUG_ON_PAGE() into a VM_WARN_ON_PAGE(). Link: https://lkml.kernel.org/r/20260109093136.1491549-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm: debug_vm_pgtable: add debug_vm_pgtable_free_huge_page()Kefeng Wang1-1/+1
Patch series "mm: hugetlb: allocate frozen gigantic folio", v6. Introduce alloc_contig_frozen_pages() and cma_alloc_frozen_compound() which avoid atomic operation about page refcount, and then convert to allocate frozen gigantic folio by the new helpers in hugetlb to cleanup the alloc_gigantic_folio(). This patch (of 6): Add a new helper to free huge page to be consistency to debug_vm_pgtable_alloc_huge_page(), and use HPAGE_PUD_ORDER instead of open-code. Also move the free_contig_range() under CONFIG_ALLOC_CONTIG since all caller are built with CONFIG_ALLOC_CONTIG. Link: https://lkml.kernel.org/r/20260109093136.1491549-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27zsmalloc: use actual object size to detect spansSergey Senozhatsky1-2/+2
Using class->size to detect spanning objects is not entirely correct, because some size classes can hold a range of object sizes of up to class->size bytes in length, due to size-classes merge. Such classes use padding for cases when actually written objects are smaller than class->size. zs_obj_read_begin() can incorrectly hit the slow path and perform memcpy of such objects, basically copying padding bytes. Instead of class->size zs_obj_read_begin() should use the actual compressed object length (both zram and zswap know it) so that it can correctly handle situations when a written object is small enough to fit into the first physical page. Link: https://lkml.kernel.org/r/20260107052145.3586917-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev> [zsmalloc & zswap] Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memcg: rename mem_cgroup_ino() to mem_cgroup_id()Shakeel Butt1-4/+4
Rename mem_cgroup_ino() to mem_cgroup_id() and mem_cgroup_get_from_ino() to mem_cgroup_get_from_id(). These functions now use cgroup IDs (from cgroup_id()) rather than inode numbers, so the names should reflect that. [shakeel.butt@linux.dev: replace ino with id, per SeongJae] Link: https://lkml.kernel.org/r/flkqanhyettp5uq22bjwg37rtmnpeg3mghznsylxcxxgaafpl4@nov2x7tagma7 [akpm@linux-foundation.org: build fix] Link: https://lkml.kernel.org/r/20251225232116.294540-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memcg: remove unused mem_cgroup_id() and mem_cgroup_from_id()Shakeel Butt1-18/+0
Now that all callers have been converted to use either: - The private ID APIs (mem_cgroup_private_id/mem_cgroup_from_private_id) for internal kernel objects that outlive their cgroup - The public cgroup ID APIs (mem_cgroup_ino/mem_cgroup_get_from_ino) for external interfaces Remove the unused wrapper functions mem_cgroup_id() and mem_cgroup_from_id() along with their !CONFIG_MEMCG stubs. Link: https://lkml.kernel.org/r/20251225232116.294540-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/damon: use cgroup ID instead of private memcg IDShakeel Butt1-2/+2
DAMON was using the internal private memcg ID which is meant for tracking kernel objects that outlive their cgroup. Switch to using the public cgroup ID instead. Link: https://lkml.kernel.org/r/20251225232116.294540-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memcg: use cgroup_id() instead of cgroup_ino() for memcg IDShakeel Butt1-5/+5
Switch mem_cgroup_ino() from using cgroup_ino() to cgroup_id(). The cgroup_ino() returns the kernfs inode number while cgroup_id() returns the kernfs node ID. For 64-bit systems, they are the same. Also cgroup_get_from_id() expects 64-bit node ID which is called by mem_cgroup_get_from_ino(). Change the type from unsigned long to u64 to match cgroup_id()'s return type, and update the format specifiers accordingly. Note that the names mem_cgroup_ino() and mem_cgroup_get_from_ino() are now misnomers since they deal with cgroup IDs rather than inode numbers. A follow-up patch will rename them. Link: https://lkml.kernel.org/r/20251225232116.294540-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memcg: expose mem_cgroup_ino() and mem_cgroup_get_from_ino() unconditionallyShakeel Butt1-4/+0
Remove the CONFIG_SHRINKER_DEBUG guards around mem_cgroup_ino() and mem_cgroup_get_from_ino(). These APIs provide a way to get a memcg's cgroup inode number and to look up a memcg from an inode number respectively. Making these functions unconditionally available allows other in-kernel users to leverage them without requiring CONFIG_SHRINKER_DEBUG to be enabled. No functional change for existing users. Link: https://lkml.kernel.org/r/20251225232116.294540-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memcg: introduce private id API for in-kernel usersShakeel Butt1-3/+21
Patch series "memcg: separate private and public ID namespaces". The memory cgroup subsystem maintains a private ID infrastructure that is decoupled from the cgroup IDs. This private ID system exists because some kernel objects (like swap entries and shadow entries in the workingset code) can outlive the cgroup they were associated with. The motivation is best described in commit 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs"). Unfortunately, some in-kernel users (DAMON, LRU gen debugfs interface, shrinker debugfs) started exposing these private IDs to userspace. This is problematic because: 1. The private IDs are internal implementation details that could change 2. Userspace already has access to cgroup IDs through the cgroup filesystem 3. Using different ID namespaces in different interfaces is confusing This series cleans up the memcg ID infrastructure by: 1. Explicitly marking the private ID APIs with "private" in their names to make it clear they are for internal use only (swap/workingset) 2. Making the public cgroup ID APIs (mem_cgroup_id/mem_cgroup_get_from_id) unconditionally available 3. Converting DAMON, LRU gen, and shrinker debugfs interfaces to use the public cgroup IDs instead of the private IDs 4. Removing the now-unused wrapper functions and renaming the public APIs for clarity After this series: - mem_cgroup_private_id() / mem_cgroup_from_private_id() are used for internal kernel objects that outlive their cgroup (swap, workingset) - mem_cgroup_id() / mem_cgroup_get_from_id() return the public cgroup ID (from cgroup_id()) for use in userspace-facing interfaces This patch (of 8): The memory cgroup maintains a private ID infrastructure decoupled from the cgroup IDs for swapout records and shadow entries. The main motivation of this private ID infra is best described in the commit 73f576c04b941 ("mm: memcontrol: fix cgroup creation failure after many small jobs"). Unfortunately some users have started exposing these private IDs to the userspace where they should have used the cgroup IDs which are already exposed to the userspace. Let's rename the memcg ID APIs to explicitly mark them private. No functional change is intended. Link: https://lkml.kernel.org/r/20251225232116.294540-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20251225232116.294540-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/page_alloc: refactor the initial compaction handlingVlastimil Babka1-1/+7
The initial direct compaction done in some cases in __alloc_pages_slowpath() stands out from the main retry loop of reclaim + compaction. We can simplify this by instead skipping the initial reclaim attempt via a new local variable compact_first, and handle the compact_prority as necessary to match the original behavior. No functional change intended. Link: https://lkml.kernel.org/r/20260106-thp-thisnode-tweak-v3-2-f5d67c21a193@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David Hildenbrand (Red Hat) <david@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/mmap_lock: add vma_is_attached() helperLorenzo Stoakes1-2/+7
This makes it easy to explicitly check for VMA detachment, which is useful for things like asserts. Note that we intentionally do not allow this function to be available should CONFIG_PER_VMA_LOCK be set - this is because vma_assert_attached() and vma_assert_detached() are no-ops if !CONFIG_PER_VMA_LOCK, so there is no correct state for vma_is_attached() to be in if this configuration option is not specified. Therefore users elsewhere must invoke this function only after checking for CONFIG_PER_VMA_LOCK. We rework the assert functions to utilise this. Link: https://lkml.kernel.org/r/0172d3bf527ca54ba27d8bce8f8476095b241ac7.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chriscli@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/rmap: make anon_vma functions internalLorenzo Stoakes1-60/+0
The bulk of the anon_vma operations are only used by mm, so formalise this by putting the function prototypes and inlines in mm/internal.h. This allows us to make changes without having to worry about the rest of the kernel. Link: https://lkml.kernel.org/r/79ec933c3a9c8bf1f64dab253bbfdae8a01cb921.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chriscli@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/rmap: remove anon_vma_merge() functionLorenzo Stoakes1-7/+0
This function is confusing, we already have the concept of anon_vma merge to adjacent VMA's anon_vma's to increase probability of anon_vma compatibility and therefore VMA merge (see is_mergeable_anon_vma() etc.), as well as anon_vma reuse, along side the usual VMA merge logic. We can remove the anon_vma check as it is redundant - a merge would not have been permitted with removal if the anon_vma's were not the same (and in the case of an unfaulted/faulted merge, we would have already set the unfaulted VMA's anon_vma to vp->remove->anon_vma in dup_anon_vma()). Avoid overloading this term when we're very simply unlinking anon_vma state from a removed VMA upon merge. Link: https://lkml.kernel.org/r/56bbe45e309f7af197b1c4f94a9a0c8931ff2d29.1768746221.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chriscli@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Harry Yoo <harry.yoo@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27Revert "mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc"Mike Rapoport (Microsoft)1-6/+0
This reverts commit d58b2498200724e4f8c12d71a5953da03c8c8bdf. hugetlb_bootmem_alloc() is called only once, no need to check if it was called already at its entry. Other checks performed during HVO initialization are also no longer necessary because sparse_init() that calls hugetlb_vmemmap_init_early() and hugetlb_vmemmap_init_late() is always called after hugetlb_bootmem_alloc(). Link: https://lkml.kernel.org/r/20260111082105.290734-30-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm, arch: consolidate hugetlb CMA reservationMike Rapoport (Microsoft)1-2/+4
Every architecture that supports hugetlb_cma command line parameter reserves CMA areas for hugetlb during setup_arch(). This obfuscates the ordering of hugetlb CMA initialization with respect to the rest initialization of the core MM. Introduce arch_hugetlb_cma_order() callback to allow architectures report the desired order-per-bit of CMA areas and provide a week implementation of arch_hugetlb_cma_order() for architectures that don't support hugetlb with CMA. Use this callback in hugetlb_cma_reserve() instead if passing the order as parameter and call hugetlb_cma_reserve() from mm_core_init_early() rather than have it spread over architecture specific code. Link: https://lkml.kernel.org/r/20260111082105.290734-28-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27arch, mm: consolidate initialization of SPARSE memory modelMike Rapoport (Microsoft)1-2/+0
Every architecture calls sparse_init() during setup_arch() although the data structures created by sparse_init() are not used until the initialization of the core MM. Beside the code duplication, calling sparse_init() from architecture specific code causes ordering differences of vmemmap and HVO initialization on different architectures. Move the call to sparse_init() from architecture specific code to free_area_init() to ensure that vmemmap and HVO initialization order is always the same. Link: https://lkml.kernel.org/r/20260111082105.290734-25-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27arch, mm: consolidate initialization of nodes, zones and memory mapMike Rapoport (Microsoft)1-2/+2
To initialize node, zone and memory map data structures every architecture calls free_area_init() during setup_arch() and passes it an array of zone limits. Beside code duplication it creates "interesting" ordering cases between allocation and initialization of hugetlb and the memory map. Some architectures allocate hugetlb pages very early in setup_arch() in certain cases, some only create hugetlb CMA areas in setup_arch() and sometimes hugetlb allocations happen mm_core_init(). With arch_zone_limits_init() helper available now on all architectures it is no longer necessary to call free_area_init() from architecture setup code. Rather core MM initialization can call arch_zone_limits_init() in a single place. This allows to unify ordering of hugetlb vs memory map allocation and initialization. Remove the call to free_area_init() from architecture specific code and place it in a new mm_core_init_early() function that is called immediately after setup_arch(). After this refactoring it is possible to consolidate hugetlb allocations and eliminate differences in ordering of hugetlb and memory map initialization among different architectures. As the first step of this consolidation move hugetlb_bootmem_alloc() to mm_core_early_init(). Link: https://lkml.kernel.org/r/20260111082105.290734-24-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Magnus Lindholm <linmag7@gmail.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27alpha: introduce arch_zone_limits_init()Mike Rapoport (Microsoft)1-0/+1
Patch series "arch, mm: consolidate hugetlb early reservation", v3. Order in which early memory reservation for hugetlb happens depends on architecture, on configuration options and on command line parameters. Some architectures rely on the core MM to call hugetlb_bootmem_alloc() while others call it very early to allow pre-allocation of HVO-style vmemmap. When hugetlb_cma is supported by an architecture it is initialized during setup_arch() and then later hugetlb_init code needs to understand did it happen or not. To make everything consistent and unified, both reservation of hugetlb memory from bootmem and creation of CMA areas for hugetlb must be called from core MM initialization and it would have been a simple change. However, HVO-style pre-initialization ordering requirements slightly complicate things and for HVO pre-init to work sparse and memory map should be initialized after hugetlb reservations. This required pulling out the call to free_area_init() out of setup_arch() path and moving it MM initialization and this is what the first 23 patches do. These changes are deliberately split into per-arch patches that change how the zone limits are calculated for each architecture and the patches 22 and 23 just remove the calls to free_area_init() and sprase_init() from arch/*. Patch 24 is a simple cleanup for MIPS. Patches 25 and 26 actually consolidate hugetlb reservations and patches 27 and 28 perform some aftermath cleanups. This patch (of 29): Move calculations of zone limits to a dedicated arch_zone_limits_init() function. Later MM core will use this function as an architecture specific callback during nodes and zones initialization and thus there won't be a need to call free_area_init() from every architecture. Link: https://lkml.kernel.org/r/20260111082105.290734-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260111082105.290734-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Magnus Lindholm <linmag7@gmail.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Shi <alexs@kernel.org> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/khugepaged: map dirty/writeback pages failures to EAGAINShivank Garg1-1/+2
Patch series "mm/khugepaged: fix dirty page handling for MADV_COLLAPSE", v5. MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages are dirty. This affects scenarios like package/container updates or executing binaries immediately after writing them, etc. The issue is that collapse_file() triggers async writeback and returns SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But MADV_COLLAPSE is synchronous and userspace expects immediate success or a clear retry signal. Reproduction: - Compile or copy 2MB-aligned executable to XFS/ext4 FS - Call MADV_COLLAPSE on .text section - First call fails with -EINVAL (text pages dirty from copy) - Second call succeeds (async writeback completed) Issue Report: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com This patch (of 2): When collapse_file encounters dirty or writeback pages in file-backed mappings, it currently returns SCAN_FAIL which maps to -EINVAL. This is misleading as EINVAL suggests invalid arguments, whereas dirty/writeback pages represent transient conditions that may resolve on retry. Introduce SCAN_PAGE_DIRTY_OR_WRITEBACK to cover both dirty and writeback states, mapping it to -EAGAIN. For MADV_COLLAPSE, this provides userspace with a clear signal that retry may succeed after writeback completes. For khugepaged, this is harmless as it will naturally revisit the range during periodic scans after async writeback completes. Link: https://lkml.kernel.org/r/20260118190939.8986-2-shivankg@amd.com Link: https://lkml.kernel.org/r/20260118190939.8986-4-shivankg@amd.com Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE") Signed-off-by: Shivank Garg <shivankg@amd.com> Reported-by: Branden Moore <Branden.Moore@amd.com> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: wang lian <lianux.mm@gmail.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27atomic: add option for weaker alignment checkFinn Thain1-1/+7
Add a new Kconfig symbol to make CONFIG_DEBUG_ATOMIC more useful on those architectures which do not align dynamic allocations to 8-byte boundaries. Without this, CONFIG_DEBUG_ATOMIC produces excessive WARN splats. Link: https://lkml.kernel.org/r/6d25a12934fe9199332f4d65d17c17de450139a8.1768281748.git.fthain@linux-m68k.org Signed-off-by: Finn Thain <fthain@linux-m68k.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: KP Singh <kpsingh@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Sasha Levin (Microsoft) <sashal@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27atomic: add alignment check to instrumented atomic operationsPeter Zijlstra1-0/+11
Add a Kconfig option for debug builds which logs a warning when an instrumented atomic operation takes place that's misaligned. Some platforms don't trap for this. [fthain@linux-m68k.org: added __DISABLE_EXPORTS conditional and refactored as helper function] Link: https://lkml.kernel.org/r/51ebf844e006ca0de408f5d3a831e7b39d7fc31c.1768281748.git.fthain@linux-m68k.org Link: https://lore.kernel.org/lkml/20250901093600.GF4067720@noisy.programming.kicks-ass.net/ Link: https://lore.kernel.org/linux-next/df9fbd22-a648-ada4-fee0-68fe4325ff82@linux-m68k.org/ Signed-off-by: Finn Thain <fthain@linux-m68k.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Guo Ren <guoren@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: KP Singh <kpsingh@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Rich Felker <dalias@libc.org> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Will Deacon <will@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27atomic: specify alignment for atomic_t and atomic64_tFinn Thain2-2/+2
Some recent commits incorrectly assumed 4-byte alignment of locks. That assumption fails on Linux/m68k (and, interestingly, would have failed on Linux/cris also). The jump label implementation makes a similar alignment assumption. The expectation that atomic_t and atomic64_t variables will be naturally aligned seems reasonable, as indeed they are on 64-bit architectures. But atomic64_t isn't naturally aligned on csky, m68k, microblaze, nios2, openrisc and sh. Neither atomic_t nor atomic64_t are naturally aligned on m68k. This patch brings a little uniformity by specifying natural alignment for atomic types. One benefit is that atomic64_t variables do not get split across a page boundary. The cost is that some structs grow which leads to cache misses and wasted memory. See also, commit bbf2a330d92c ("x86: atomic64: The atomic64_t data type should be 8 bytes aligned on 32-bit too"). Link: https://lkml.kernel.org/r/a76bc24a4e7c1d8112d7d5fa8d14e4b694a0e90c.1768281748.git.fthain@linux-m68k.org Link: https://lore.kernel.org/lkml/CAFr9PX=MYUDGJS2kAvPMkkfvH+0-SwQB_kxE4ea0J_wZ_pk=7w@mail.gmail.com Link: https://lore.kernel.org/lkml/CAMuHMdW7Ab13DdGs2acMQcix5ObJK0O2dG_Fxzr8_g58Rc1_0g@mail.gmail.com/ Signed-off-by: Finn Thain <fthain@linux-m68k.org> Acked-by: Guo Ren <guoren@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Guo Ren <guoren@kernel.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Stafford Horne <shorne@gmail.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Hao Luo <haoluo@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin (Microsoft) <sashal@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27bpf: explicitly align bpf_res_spin_lockFinn Thain1-1/+1
Patch series "Align atomic storage", v7. This series adds the __aligned attribute to atomic_t and atomic64_t definitions in include/linux and include/asm-generic (respectively) to get natural alignment of both types on csky, m68k, microblaze, nios2, openrisc and sh. This series also adds Kconfig options to enable a new run-time warning to help reveal misaligned atomic accesses on platforms which don't trap that. The performance impact is expected to vary across platforms and workloads. The measurements I made on m68k show that some workloads run faster and others slower. This patch (of 4): Align bpf_res_spin_lock to avoid a BUILD_BUG_ON() when the alignment changes, as it will do on m68k when, in a subsequent patch, the minimum alignment of the atomic_t member of struct rqspinlock gets increased from 2 to 4. Drop the BUILD_BUG_ON() as it becomes redundant. Link: https://lkml.kernel.org/r/cover.1768281748.git.fthain@linux-m68k.org Link: https://lkml.kernel.org/r/8a83876b07d1feacc024521e44059ae89abbb1ea.1768281748.git.fthain@linux-m68k.org Signed-off-by: Finn Thain <fthain@linux-m68k.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: "Borislav Petkov (AMD)" <bp@alien8.de> Cc: Daniel Borkman <daniel@iogearbox.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Gary Guo <gary@garyguo.net> Cc: Guo Ren <guoren@kernel.org> Cc: Hao Luo <haoluo@google.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonas Bonn <jonas@southpole.se> Cc: KP Singh <kpsingh@kernel.org> Cc: Marc Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Sasha Levin (Microsoft) <sashal@kernel.org> Cc: Song Liu <song@kernel.org> Cc: Stafford Horne <shorne@gmail.com> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27linux/log2.h: reduce instruction count for is_power_of_2()Maciej W. Rozycki1-1/+1
Follow an observation that (n ^ (n - 1)) will only ever retain the most significant bit set in the word operated on if that is the only bit set in the first place, and use it to determine whether a number is a whole power of 2, avoiding the need for an explicit check for nonzero. This reduces the sequence produced to 3 instructions only across Alpha, MIPS, and RISC-V targets, down from 4, 5, and 4 respectively, removing a branch in the two latter cases. And it's 5 instructions on POWER and x86-64 vs 8 and 9 respectively. There are no branches now emitted here for targets that have a suitable conditional set operation, although an inline expansion will often end with one, depending on what code a call to this function is used in. Credit goes to GCC authors for coming up with this optimisation used as the fallback for (__builtin_popcountl(n) == 1), equivalent to this code, for targets where the hardware population count operation is considered expensive. Link: https://lkml.kernel.org/r/alpine.DEB.2.21.2601111836250.30566@angie.orcam.me.uk Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: John Garry <john.g.garry@oracle.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Su Hui <suhui@nfschina.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27kho/abi: add memblock ABI headerMike Rapoport (Microsoft)1-0/+73
Introduce KHO ABI header describing preservation ABI for memblock's reserve_mem regions and link the relevant documentation to KHO docs. [lukas.bulwahn@redhat.com: MAINTAINERS: adjust file entry in MEMBLOCK AND MEMORY MANAGEMENT INITIALIZATION] Link: https://lkml.kernel.org/r/20260107090438.22901-1-lukas.bulwahn@redhat.com [rppt@kernel.org: update reserved_mem node description, per Pratyush] Link: https://lkml.kernel.org/r/aW_M-HYZzx5SkbnZ@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-7-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jason Miu <jasonmiu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27kho: relocate vmalloc preservation structure to KHO ABI headerJason Miu3-26/+81
The `struct kho_vmalloc` defines the in-memory layout for preserving vmalloc regions across kexec. This layout is a contract between kernels and part of the KHO ABI. To reflect this relationship, the related structs and helper macros are relocated to the ABI header, `include/linux/kho/abi/kexec_handover.h`. This move places the structure's definition under the protection of the KHO_FDT_COMPATIBLE version string. The structure and its components are now also documented within the ABI header to describe the contract and prevent ABI breaks. [rppt@kernel.org: update comment, per Pratyush] Link: https://lkml.kernel.org/r/aW_Mqp6HcqLwQImS@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-6-rppt@kernel.org Signed-off-by: Jason Miu <jasonmiu@google.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Pratyush Yadav <pratyush@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27kho: introduce KHO FDT ABI headerJason Miu1-0/+85
Introduce the `include/linux/kho/abi/kexec_handover.h` header file, which defines the stable ABI for the KHO mechanism. This header specifies how preserved data is passed between kernels using an FDT. The ABI contract includes the FDT structure, node properties, and the "kho-v1" compatible string. By centralizing these definitions, this header serves as the foundational agreement for inter-kernel communication of preserved states, ensuring forward compatibility and preventing misinterpretation of data across kexec transitions. Since the ABI definitions are now centralized in the header files, the YAML files that previously described the FDT interfaces are redundant. These redundant files have therefore been removed. Link: https://lkml.kernel.org/r/20260105165839.285270-5-rppt@kernel.org Signed-off-by: Jason Miu <jasonmiu@google.com> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27kho/abi: memfd: make generated documentation more coherentMike Rapoport (Microsoft)1-2/+2
memfd preservation ABI description starts with "This header defines" which is fine in the header but reads weird in the generated html documentation. Update it to make the generated documentation coherent. Link: https://lkml.kernel.org/r/20260105165839.285270-3-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jason Miu <jasonmiu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27kho/abi: luo: make generated documentation more coherentMike Rapoport (Microsoft)1-4/+4
Patch series "kho: ABI headers and Documentation updates". LUO started adding KHO ABI headers to include/linux/kho/abi, but the core parts of KHO and memblock are still using the old way for descriptions on their ABIs. Let's consolidate all things KHO in include/linux/kho/abi. And while on that, make some documentation updates to have more coherent KHO docs. This patch (of 6): LUO ABI description starts with "This header defines" which is fine in the header but reads weird in the generated html documentation. Update it to make the generated documentation coherent. Link: https://lkml.kernel.org/r/20260105165839.285270-1-rppt@kernel.org Link: https://lkml.kernel.org/r/20260105165839.285270-2-rppt@kernel.org Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Jason Miu <jasonmiu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27ima: verify the previous kernel's IMA buffer lies in addressable RAMHarshit Mogalapalli1-0/+1
Patch series "Address page fault in ima_restore_measurement_list()", v3. When the second-stage kernel is booted via kexec with a limiting command line such as "mem=<size>" we observe a pafe fault that happens. BUG: unable to handle page fault for address: ffff97793ff47000 RIP: ima_restore_measurement_list+0xdc/0x45a #PF: error_code(0x0000) not-present page This happens on x86_64 only, as this is already fixed in aarch64 in commit: cbf9c4b9617b ("of: check previous kernel's ima-kexec-buffer against memory bounds") This patch (of 3): When the second-stage kernel is booted with a limiting command line (e.g. "mem=<size>"), the IMA measurement buffer handed over from the previous kernel may fall outside the addressable RAM of the new kernel. Accessing such a buffer can fault during early restore. Introduce a small generic helper, ima_validate_range(), which verifies that a physical [start, end] range for the previous-kernel IMA buffer lies within addressable memory: - On x86, use pfn_range_is_mapped(). - On OF based architectures, use page_is_ram(). Link: https://lkml.kernel.org/r/20251231061609.907170-1-harshit.m.mogalapalli@oracle.com Link: https://lkml.kernel.org/r/20251231061609.907170-2-harshit.m.mogalapalli@oracle.com Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reviewed-by: Mimi Zohar <zohar@linux.ibm.com> Cc: Alexander Graf <graf@amazon.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Henry Willard <henry.willard@oracle.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Joel Granados <joel.granados@kernel.org> Cc: Jonathan McDowell <noodles@fb.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paul Webb <paul.x.webb@oracle.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Yifei Liu <yifei.l.liu@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27types: drop definition of __EXPORTED_HEADERS__Thomas Weißschuh1-1/+0
This definition disarms the warning in uapi/linux/types.h about including kernel headers from user space. However the warning is already disarmed due to the fact that kernel code is built with -D__KERNEL__. Drop the pointless definition. Link: https://lkml.kernel.org/r/20251230-exported-headers-types-h-v1-1-947fc606f3d8@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27resource: provide 0args DEFINE_RES variant for unset resource descChristian Marangi1-1/+6
Provide a variant of DEFINE_RES that takes 0 arguments to initialize an "unset" resource descriptor. This should be used for the improper case of struct resource res = {}; where DEFINE_RES() should be used. With this new helper variant, it would result in: struct resource res = DEFINE_RES(); instead of having to define the full 3 arguments: struct resource res = DEFINE_RES(0, 0, IORESOURCE_UNSET); DEFINE_RES() with no args, will set the flags to IORESOURCE_UNSET signaling the resource descriptor is UNSET and doesn't reflect an actual resource currently. Link: https://lkml.kernel.org/r/20251213115314.16700-1-ansuelsmth@gmail.com Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Suggested-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27ipc/shm: uapi: remove dependency on libcThomas Weißschuh1-3/+0
Using libc types and headers from the UAPI headers is problematic as it introduces a dependency on a full C toolchain. shm.h does not even use any symbols from the libc header as the usage of getpagesize() was removed a decade ago in commit 060028bac94b ("ipc/shm.c: increase the defaults for SHMALL, SHMMAX") Drop the unnecessary inclusion. Link: https://lkml.kernel.org/r/20251222-uapi-shm-v1-1-270bb7f75d97@linutronix.de Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/zone_device: reinitialize large zone device private foliosMatthew Brost1-3/+6
Reinitialize metadata for large zone device private folios in zone_device_page_init prior to creating a higher-order zone device private folio. This step is necessary when the folio's order changes dynamically between zone_device_page_init calls to avoid building a corrupt folio. As part of the metadata reinitialization, the dev_pagemap must be passed in from the caller because the pgmap stored in the folio page may have been overwritten with a compound head. Without this fix, individual pages could have invalid pgmap fields and flags (with PG_locked being notably problematic) due to prior different order allocations, which can, and will, result in kernel crashes. Link: https://lkml.kernel.org/r/20260116111325.1736137-2-francois.dugast@intel.com Fixes: d245f9b4ab80 ("mm/zone_device: support large zone device private folios") Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Francois Dugast <francois.dugast@intel.com> Acked-by: Felix Kuehling <felix.kuehling@amd.com> Reviewed-by: Balbir Singh <balbirs@nvidia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: David Airlie <airlied@gmail.com> Cc: Simona Vetter <simona@ffwll.ch> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Lyude Paul <lyude@redhat.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27memfd: export alloc_file()Pratyush Yadav (Google)1-0/+6
Patch series "mm: memfd_luo hotfixes". This series contains a couple of fixes for memfd preservation using LUO. This patch (of 3): The Live Update Orchestrator's (LUO) memfd preservation works by preserving all the folios of a memfd, re-creating an empty memfd on the next boot, and then inserting back the preserved folios. Currently it creates the file by directly calling shmem_file_setup(). This leaves out other work done by alloc_file() like setting up the file mode, flags, or calling the security hooks. Export alloc_file() to let memfd_luo use it. Rename it to memfd_alloc_file() since it is no longer private and thus needs a subsystem prefix. Link: https://lkml.kernel.org/r/20260122151842.4069702-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20260122151842.4069702-2-pratyush@kernel.org Signed-off-by: Pratyush Yadav (Google) <pratyush@kernel.org> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27mm/kasan: fix KASAN poisoning in vrealloc()Andrey Ryabinin1-0/+14
A KASAN warning can be triggered when vrealloc() changes the requested size to a value that is not aligned to KASAN_GRANULE_SIZE. ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1 at mm/kasan/shadow.c:174 kasan_unpoison+0x40/0x48 ... pc : kasan_unpoison+0x40/0x48 lr : __kasan_unpoison_vmalloc+0x40/0x68 Call trace: kasan_unpoison+0x40/0x48 (P) vrealloc_node_align_noprof+0x200/0x320 bpf_patch_insn_data+0x90/0x2f0 convert_ctx_accesses+0x8c0/0x1158 bpf_check+0x1488/0x1900 bpf_prog_load+0xd20/0x1258 __sys_bpf+0x96c/0xdf0 __arm64_sys_bpf+0x50/0xa0 invoke_syscall+0x90/0x160 Introduce a dedicated kasan_vrealloc() helper that centralizes KASAN handling for vmalloc reallocations. The helper accounts for KASAN granule alignment when growing or shrinking an allocation and ensures that partial granules are handled correctly. Use this helper from vrealloc_node_align_noprof() to fix poisoning logic. [ryabinin.a.a@gmail.com: move kasan_enabled() check, fix build] Link: https://lkml.kernel.org/r/20260119144509.32767-1-ryabinin.a.a@gmail.com Link: https://lkml.kernel.org/r/20260113191516.31015-1-ryabinin.a.a@gmail.com Fixes: d699440f58ce ("mm: fix vrealloc()'s KASAN poisoning logic") Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: <joonki.min@samsung-slsi.corp-partner.google.com> Closes: https://lkml.kernel.org/r/CANP3RGeuRW53vukDy7WDO3FiVgu34-xVJYkfpm08oLO3odYFrA@mail.gmail.com Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitriy Vyukov <dvyukov@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-27tracing: Add bitmask-list option for human-readable bitmask displayAaron Tomlin3-7/+17
Add support for displaying bitmasks in human-readable list format (e.g., 0,2-5,7) in addition to the default hexadecimal bitmap representation. This is particularly useful when tracing CPU masks and other large bitmasks where individual bit positions are more meaningful than their hexadecimal encoding. When the "bitmask-list" option is enabled, the printk "%*pbl" format specifier is used to render bitmasks as comma-separated ranges, making trace output easier to interpret for complex CPU configurations and large bitmask values. Link: https://patch.msgid.link/20251226160724.2246493-2-atomlin@atomlin.com Signed-off-by: Aaron Tomlin <atomlin@atomlin.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2026-01-26Merge tag 'vfs-6.19-rc8.fixes' of ↵Linus Torvalds1-1/+5
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix the the buggy conversion of fuse_reverse_inval_entry() introduced during the creation rework - Disallow nfs delegation requests for directories by setting simple_nosetlease() - Require an opt-in for getting readdir flag bits outside of S_DT_MASK set in d_type - Fix scheduling delayed writeback work by only scheduling when the dirty time expiry interval is non-zero and cancel the delayed work if the interval is set to zero - Use rounded_jiffies_interval for dirty time work - Check the return value of sb_set_blocksize() for romfs - Wait for batched folios to be stable in __iomap_get_folio() - Use private naming for fuse hash size - Fix the stale dentry cleanup to prevent a race that causes a UAF * tag 'vfs-6.19-rc8.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: vfs: document d_dispose_if_unused() fuse: shrink once after all buckets have been scanned fuse: clean up fuse_dentry_tree_work() fuse: add need_resched() before unlocking bucket fuse: make sure dentry is evicted if stale fuse: fix race when disposing stale dentries fuse: use private naming for fuse hash size writeback: use round_jiffies_relative for dirtytime_work iomap: wait for batched folios to be stable in __iomap_get_folio romfs: check sb_set_blocksize() return value docs: clarify that dirtytime_expire_seconds=0 disables writeback writeback: fix 100% CPU usage when dirtytime_expire_interval is 0 readdir: require opt-in for d_type flags vboxsf: don't allow delegations to be set on directories ceph: don't allow delegations to be set on directories gfs2: don't allow delegations to be set on directories 9p: don't allow delegations to be set on directories smb/client: properly disallow delegations on directories nfs: properly disallow delegation requests on directories fuse: fix conversion of fuse_reverse_inval_entry() to start_removing()
2026-01-26xdrgen: Add enum value validation to generated decodersChuck Lever1-1/+7
XDR enum decoders generated by xdrgen do not verify that incoming values are valid members of the enum. Incoming out-of-range values from malicious or buggy peers propagate through the system unchecked. Add validation logic to generated enum decoders using a switch statement that explicitly lists valid enumerator values. The compiler optimizes this to a simple range check when enum values are dense (contiguous), while correctly rejecting invalid values for sparse enums with gaps in their value ranges. The --no-enum-validation option on the source subcommand disables this validation when not needed. The minimum and maximum fields in _XdrEnum, which were previously unused placeholders for a range-based validation approach, have been removed since the switch-based validation handles both dense and sparse enums correctly. Because the new mechanism results in substantive changes to generated code, existing .x files are regenerated. Unrelated white space and semicolon changes in the generated code are due to recent commit 1c873a2fd110 ("xdrgen: Don't generate unnecessary semicolon") and commit 38c4df91242b ("xdrgen: Address some checkpatch whitespace complaints"). Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26xdrgen: Initialize data pointer for zero-length itemsChuck Lever1-12/+8
The xdrgen decoders for strings and opaque data had an optimization that skipped calling xdr_inline_decode() when the item length was zero. This left the data pointer uninitialized, which could lead to unpredictable behavior when callers access it. Remove the zero-length check and always call xdr_inline_decode(). When passed a length of zero, xdr_inline_decode() returns the current buffer position, which is valid and matches the behavior of hand-coded XDR decoders throughout the kernel. Fixes: 4b132aacb076 ("tools: Add xdrgen") Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26xdrgen: Implement short (16-bit) integer typesChuck Lever1-0/+60
"short" and "unsigned short" types are not defined in RFC 4506, but are supported by the rpcgen program. An upcoming protocol specification includes at least one "unsigned short" field, so xdrgen needs to implement support for these types. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26NFS: NFSERR_INVAL is not defined by NFSv2Chuck Lever1-1/+1
A documenting comment in include/uapi/linux/nfs.h claims incorrectly that NFSv2 defines NFSERR_INVAL. There is no such definition in either RFC 1094 or https://pubs.opengroup.org/onlinepubs/9629799/chap7.htm NFS3ERR_INVAL is introduced in RFC 1813. NFSD returns NFSERR_INVAL for PROC_GETACL, which has no specification (yet). However, nfsd_map_status() maps nfserr_symlink and nfserr_wrong_type to nfserr_inval, which does not align with RFC 1094. This logic was introduced only recently by commit 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code."). Given that we have no INVAL or SERVERFAULT status in NFSv2, probably the only choice is NFSERR_IO. Fixes: 438f81e0e92a ("nfsd: move error choice for incorrect object types to version-specific code.") Reviewed-by: NeilBrown <neil@brown.name> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2026-01-26Merge tag 'qcom-arm64-for-6.20' of ↵Arnd Bergmann2-0/+4
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into soc/dt Qualcomm Arm64 DeviceTree for v6.20 Introduce the Kaanapali SoC, with the MTP and QRD devices. Introduce support for the Milos SoC (SM7635) and initial support for the Fairphone (Gen 6) device on this platform. Add the QCS6490-based RubikPI3 board, the QRB2210-based Arduino UnoQ, the X Elite-based Medion SPRCHRGD 14 S1 and Surface Pro 11 laptops, and the SDM845-based Pixel 3 and Pixel 3 XL devices. On the Kodiak-based (QCS6490) RB3Gen2 the TC9563 PCIe switch controller is described. On Lemans (SA8775P/QCS9075) the GPU and crypto blocks are added. IO-regions and clocks are added to interconnect nodes to allow QoS configuration. GPU, TPM and USB support are enabled on the evaluation kit (EVK). On Monaco (QCS8300) the two PCIe controllers, the camera subsystem, tsens, display subsystem, crypto, CPUfreq, and coresight are added. On the evaluation kit (EVK) the PCIe busses are enabled, together with an AMC6821-based fan controller and the ST33 TPM chip. On MSM8939 the camera subsystem is described. The Asus ZenFone 2 Laser/Selfie gains battery and hall sensor support. On the Agatti-based RB1 board PM8008 is described and an overlay for the Vision mezzanine is introduced. On SDM630 the compute DSP remoteproc, FastRPC and related entites are described. The LPASS LPI pinctrl node is described. On SDM845-based OnePlus device the bootloader framebuffer and its resources are described, to improve the transition. On the SDM845-based devices from OnePlus, SHIFT, and Xiaomi ath10k calibration variants are specified. The sensor remoteproc is enabled on Xiaomi Pocophone F1. On SM7225-based Fairphone FP4 regulators for the cameras are described, and the camera EEPROM is added. On SM8650 the camera subsystem is described. On the QRD the Samsung S5KJN1 camera sensor is added, and for the HDK an overlay for the "Rear Camera Card" is added. On SM8750 CPUfreq, SDCHCI and Iris (video encode/decode) support are added, and missing - required - properties for the BAM DMA is added. These are then enabled on the MTP. On Talos (SM6150/QCS615) PMU, DisplayPort, and USB/DP combo PHY are added. DisplayPort is enabled on the Talos Ride board. On Hamoa (X Elite) add crypto engine, missing TCSR reference clocks, and random number generator block. The soc bus address width is corrected to match the hardware. On the Lenovo Thinkpad T14s HDMI and audio playback over DisplayPort is introduced. HDMI, Iris (video encode/decode) and PS8830 retimers are described for the ASUS Vivobook S 15. On the Hamoa evaluation kit (EVK) PCIe busses, WiFi, backlight, TPM and RG (red/green) LEDs are described. Enable QSEECOM, and thereby UEFI variable access, on the Medion SPRCHRGD 14 S1 (commit should have been on drivers branch). * tag 'qcom-arm64-for-6.20' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: (155 commits) dt-bindings: mailbox: qcom: Add IPCC support for Kaanapali and Glymur Platforms dt-bindings: mailbox: qcom: Add CPUCP mailbox controller bindings for Kaanapali arm64: dts: qcom: lemans: enable static TPDM arm64: dts: qcom: kodiak: Add memory region for audiopd arm64: dts: qcom: x1e78100-lenovo-thinkpad-t14s: add HDMI nodes arm64: dts: qcom: x1e: bus is 40-bits (fix 64GB models) arm64: dts: qcom: lemans; Add EL2 overlay arm64: dts: qcom: sm8150: add uart13 arm64: dts: qcom: sdm845-db845c: specify power for WiFi CH1 arm64: dts: qcom: sdm845-db845c: drop CS from SPIO0 arm64: dts: qcom: qrb4210-rb2: Fix UART3 wakeup IRQ storm arm64: dts: qcom: sm6125-ginkgo: Fix missing msm-id subtype arm64: dts: qcom: qcs8300: Add GPU cooling arm64: dts: qcom: sa8775p: Add reg and clocks for QoS configuration arm64: dts: qcom: hamoa-iot-evk: Enable TPM (ST33) on SPI11 arm64: dts: qcom: talos: Add PMU support arm64: dts: qcom: talos: switch to interrupt-cells 4 to add PPI partitions arm64: dts: qcom: ipq9574: Complete USB DWC3 wrapper interrupts arm64: dts: qcom: ipq5018: Correct USB DWC3 wrapper interrupts arm64: dts: qcom: monaco: Add CTCU and ETR nodes ... Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-01-26Merge tag 'renesas-dts-for-v6.20-tag2' of ↵Arnd Bergmann2-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel into soc/dt Renesas DTS updates for v6.20 (take two) - Add cpufreq, thermal, GPIO IRQ, and CAN-FD support for the RZ/T2H and RZ/N2H SoCs and their EVK boards, - Add more serial (RSCI) and CAN-FD support for the RZ/V2H and RZ/V2N SoCs, - Drop unused .dtsi files, - Add I3C support for the RZ/G3E SMARC SoM, - Add GPIO support for the RZ/N1 SoC, - Miscellaneous fixes and improvements. * tag 'renesas-dts-for-v6.20-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-devel: (27 commits) arm64: dts: renesas: rzt2h-rzn2h-evk: Reorder ADC nodes ARM: dts: r9a06g032: Add support for GPIO interrupts ARM: dts: r9a06g032: Add GPIO controllers arm64: dts: renesas: rzg3e-smarc-som: Enable I3C support arm64: dts: renesas: Use lowercase hex arm64: dts: renesas: Use hyphens in node names arm/arm64: dts: renesas: Drop unused .dtsi arm64: dts: renesas: rzt2h-n2h-evk-common: Use GPIO for SD0 write protect arm64: dts: renesas: r9a09g057: Add CANFD node arm64: dts: renesas: r9a09g056: Add CANFD node arm64: dts: renesas: r9a09g087m44-rzn2h-evk: Enable CANFD arm64: dts: renesas: r9a09g077m44-rzt2h-evk: Enable CANFD arm64: dts: renesas: r9a09g087: Add CANFD node arm64: dts: renesas: r9a09g077: Add CANFD node arm64: dts: renesas: r9a09g057: Add RSCI nodes arm64: dts: renesas: r9a09g056: Add RSCI nodes arm64: dts: renesas: r9a09g087m44-rzn2h-evk: Add GPIO keys arm64: dts: renesas: r9a09g077m44-rzt2h-evk: Add GPIO keys arm64: dts: renesas: r9a09g087: Add GPIO IRQ support arm64: dts: renesas: r9a09g077: Add GPIO IRQ support ... Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2026-01-26Merge branch 'fixes' of into for-nextIlpo Järvinen1-8/+9