Age | Commit message (Collapse) | Author | Files | Lines |
|
The PGDAT_RECLAIM_LOCKED bit is used to provide mutual exclusion of node
reclaim for struct pglist_data using a single bit.
It is "locked" with a test_and_set_bit (similarly to a try lock) which
provides full ordering with respect to loads and stores done within
__node_reclaim().
It is "unlocked" with clear_bit(), which does not provide any ordering
with respect to loads and stores done before clearing the bit.
The lack of clear_bit() memory ordering with respect to stores within
__node_reclaim() can cause a subsequent CPU to fail to observe stores from
a prior node reclaim. This is not an issue in practice on TSO (e.g.
x86), but it is an issue on weakly-ordered architectures (e.g. arm64).
Fix this by using clear_bit_unlock rather than clear_bit to clear
PGDAT_RECLAIM_LOCKED with a release memory ordering semantic.
This provides stronger memory ordering (release rather than relaxed).
Link: https://lkml.kernel.org/r/20250312141014.129725-1-mathieu.desnoyers@efficios.com
Fixes: d773ed6b856a ("mm: test and set zone reclaim lock before starting reclaim")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Because madise_should_skip() logic is factored out, making
madvise_do_behavior() calculates 'len' on its own rather then receiving it
as a parameter makes code simpler. Remove the parameter.
Link: https://lkml.kernel.org/r/20250312164750.59215-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The logic for checking if a given madvise() request for a single memory
range can skip real work, namely madvise_do_behavior(), is duplicated in
do_madvise() and vector_madvise(). Split out the logic to a function and
reuse it.
Link: https://lkml.kernel.org/r/20250312164750.59215-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
madvise_do_behavior() has a long open-coded 'behavior' check for
MADV_POPULATE_{READ,WRITE}. It adds multiple layers[1] and make the code
arguably take longer time to read. Like is_memory_failure(), split out
the check to a separate function. This is not technically removing the
additional layer but discourage further extending the switch-case. Also
it makes madvise_do_behavior() code shorter and therefore easier to read.
[1] https://lore.kernel.org/bd6d0bf1-c79e-46bd-a810-9791efb9ad73@lucifer.local
Link: https://lkml.kernel.org/r/20250312164750.59215-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/madvise: cleanup requests validations and classifications".
Cleanup madvise entry level code for cleaner request validations and
classifications.
This patch (of 4):
To reduce redundant open-coded checks of CONFIG_MEMORY_FAILURE and
MADV_{HWPOISON,SOFT_OFFLINE} in madvise_[un]lock(), is_memory_failure() is
introduced. madvise_do_behavior() is still doing the same open-coded
check, though. Use is_memory_failure() instead.
To avoid build failure on !CONFIG_MEMORY_FAILURE case, implement an empty
madvise_inject_error() under the config. Also move the definition of
is_memory_failure() inside #ifdef CONFIG_MEMORY_FAILURE clause for
madvise_inject_error() definition, to reduce duplicated ifdef clauses.
Link: https://lkml.kernel.org/r/20250312164750.59215-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250312164750.59215-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: David Hildenbrand <david@redhat.com>
Cc: Liam R. Howlett <howlett@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This commit introduces a new trace event,
`mm_calculate_totalreserve_pages`, which reports the new reserve value at
the exact time when it takes effect.
The `totalreserve_pages` value represents the total amount of memory
reserved across all zones and nodes in the system. This reserved memory
is crucial for ensuring that critical kernel operations have access to
sufficient memory, even under memory pressure.
By tracing the `totalreserve_pages` value, developers can gain insights
that how the total reserved memory changes over time.
Link: https://lkml.kernel.org/r/20250308034606.2036033-4-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This commit introduces the `mm_setup_per_zone_lowmem_reserve` trace
event,which provides detailed insights into the kernel's per-zone lowmem
reserve configuration.
The trace event provides precise timestamps, allowing developers to
1. Correlate lowmem reserve changes with specific kernel events and
able to diagnose unexpected kswapd or direct reclaim behavior triggered
by dynamic changes in lowmem reserve.
2. Know memory allocation failures that occur due to insufficient
lowmem reserve, by precisely correlating allocation attempts with
reserve adjustments.
Link: https://lkml.kernel.org/r/20250308034606.2036033-3-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Add tracepoints for lowmem reserves, watermarks and
totalreserve_pages", v2.
This patchset introduces tracepoints to track changes in the lowmem
reserves, watermarks and totalreserve_pages. This helps to track
the exact timing of such changes and understand their relation to
reclaim activities.
The tracepoints added are:
mm_setup_per_zone_lowmem_reserve
mm_setup_per_zone_wmarks
mm_calculate_totalreserve_pagesi
This patch (of 3):
This commit introduces the `mm_setup_per_zone_wmarks` trace event,
which provides detailed insights into the kernel's per-zone watermark
configuration, offering precise timing and the ability to correlate
watermark changes with specific kernel events.
While `/proc/zoneinfo` provides some information about zone watermarks,
this trace event offers:
1. The ability to link watermark changes to specific kernel events and
logic.
2. The ability to capture rapid or short-lived changes in watermarks
that may be missed by user-space polling
3. Diagnosing unexpected kswapd activity or excessive direct reclaim
triggered by rapidly changing watermarks.
Link: https://lkml.kernel.org/r/20250308034606.2036033-1-liumartin@google.com
Link: https://lkml.kernel.org/r/20250308034606.2036033-2-liumartin@google.com
Signed-off-by: Martin Liu <liumartin@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Martin Liu <liumartin@google.com>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add missing parenthesis in @name parameter description.
Link: https://lkml.kernel.org/r/20250310112535.84754-1-enrico.bravi@polito.it
Signed-off-by: Enrico Bravi <enrico.bravi@polito.it>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is best practice for all pte accesses to go via the arch helpers, to
ensure non-torn values and to allow the arch to intervene where needed
(contpte for arm64 for example). While in this case it was probably safe
to directly dereference, let's tidy it up for consistency.
Link: https://lkml.kernel.org/r/20250310140418.1737409-1-ryan.roberts@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Replace @blocks with @memory_blocks to match with the definition of struct
memory_group.
Link: https://lkml.kernel.org/r/20250311233045.148943-3-gshan@redhat.com
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "drivers/base/memory: Two cleanups", v3.
Two cleanups to drivers/base/memory.
This patch (of 2)L
It's unnecessary to count the present sections for the specified block
since the block will be added if any section in the block is present.
Besides, for_each_present_section_nr() can be reused as Andrew Morton
suggested.
Improve by using for_each_present_section_nr() and dropping the
unnecessary @section_count.
No functional changes intended.
Link: https://lkml.kernel.org/r/20250311233045.148943-1-gshan@redhat.com
Link: https://lkml.kernel.org/r/20250311233045.148943-2-gshan@redhat.com
Signed-off-by: Gavin Shan <gshan@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_access_pattern_add_range_dir()
When -Wformat-security is given, compiler warns as a potential security
issue on damon_sysfs_access_pattern_add_range_dir() as below:
mm/damon/sysfs-schemes.c: In function `damon_sysfs_access_pattern_add_range_dir':
mm/damon/sysfs-schemes.c:1503:25: warning: format not a string literal and no format arguments [-Wformat-security]
1503 | &access_pattern->kobj, name);
| ^
Fix it by using "%s" as the format and the name as the argument.
Link: https://lkml.kernel.org/r/20250310165009.652491-1-sj@kernel.org
Fixes: 7e84b1f8212a ("mm/damon/sysfs: support DAMON-based Operation Schemes")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
During shmem_split_large_entry(), large swap entries are covering n slots
and an order-0 folio needs to be inserted.
Instead of splitting all n slots, only the 1 slot covered by the folio
need to be split and the remaining n-1 shadow entries can be retained with
orders ranging from 0 to n-1. This method only requires
(n/XA_CHUNK_SHIFT) new xa_nodes instead of (n % XA_CHUNK_SHIFT) *
(n/XA_CHUNK_SHIFT) new xa_nodes, compared to the original
xas_split_alloc() + xas_split() one.
For example, to split an order-9 large swap entry (assuming XA_CHUNK_SHIFT
is 6), 1 xa_node is needed instead of 8.
xas_try_split_min_order() is used to reduce the number of calls to
xas_try_split() during split.
Link: https://lkml.kernel.org/r/20250314222113.711703-3-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Minimize xa_node allocation during xarry split", v3.
When splitting a multi-index entry in XArray from order-n to order-m,
existing xas_split_alloc()+xas_split() approach requires 2^(n %
XA_CHUNK_SHIFT) xa_node allocations. But its callers,
__filemap_add_folio() and shmem_split_large_entry(), use at most 1
xa_node. To minimize xa_node allocation and remove the limitation of no
split from order-12 (or above) to order-0 (or anything between 0 and
5)[1], xas_try_split() was added[2], which allocates (n / XA_CHUNK_SHIFT -
m / XA_CHUNK_SHIFT) xa_node. It is used for non-uniform folio split, but
can be used by __filemap_add_folio() and shmem_split_large_entry().
xas_split_alloc() and xas_split() split an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
| | | |
------- --- --- -------
| | ... | |
V V V V
----------- ----------- ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- ----------- ----------- -----------
xas_try_split() splits an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
|
|
V
-----------
| xa_node |
-----------
xas_try_split() is designed to be called iteratively with n = m + 1.
xas_try_split_mini_order() is added to minmize the number of calls to
xas_try_split() by telling the caller the next minimal order to split to
instead of n - 1. Splitting order-n to order-m when m= l * XA_CHUNK_SHIFT
does not require xa_node allocation and requires 1 xa_node when n=l *
XA_CHUNK_SHIFT and m = n - 1, so it is OK to use xas_try_split() with n >
m + 1 when no new xa_node is needed.
xfstests quick group test passed on xfs and tmpfs.
[1] https://lore.kernel.org/linux-mm/Z6YX3RznGLUD07Ao@casper.infradead.org/
[2] https://lore.kernel.org/linux-mm/20250226210032.2044041-1-ziy@nvidia.com/
This patch (of 2):
During __filemap_add_folio(), a shadow entry is covering n slots and a
folio covers m slots with m < n is to be added. Instead of splitting all
n slots, only the m slots covered by the folio need to be split and the
remaining n-m shadow entries can be retained with orders ranging from m to
n-1. This method only requires
(n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT)
new xa_nodes instead of
(n % XA_CHUNK_SHIFT) * ((n/XA_CHUNK_SHIFT) - (m/XA_CHUNK_SHIFT))
new xa_nodes, compared to the original xas_split_alloc() + xas_split()
one. For example, to insert an order-0 folio when an order-9 shadow entry
is present (assuming XA_CHUNK_SHIFT is 6), 1 xa_node is needed instead of
8.
xas_try_split_min_order() is introduced to reduce the number of calls to
xas_try_split() during split.
Link: https://lkml.kernel.org/r/20250314222113.711703-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250314222113.711703-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It splits page cache folios to orders from 0 to 8 at different in-folio
offset.
Link: https://lkml.kernel.org/r/20250307174001.242794-9-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Instead of splitting the large folio uniformly during truncation, try to
use buddy allocator like folio_split() at the start and the end of a
truncation range to minimize the number of resulting folios if it is
supported. try_folio_split() is introduced to use folio_split() if
supported and it falls back to uniform split otherwise.
For example, to truncate a order-4 folio
[0, 1, 2, 3, 4, 5, ..., 15]
between [3, 10] (inclusive), folio_split() splits the folio at 3 to
[0,1], [2], [3], [4..7], [8..15] and [3], [4..7] can be dropped and
[8..15] is kept with zeros in [8..10], then another folio_split() is
done at 10, so [8..10] can be dropped.
One possible optimization is to make folio_split() to split a folio based
on a given range, like [3..10] above. But that complicates folio_split(),
so it will be investigated when necessary.
Link: https://lkml.kernel.org/r/20250226210032.2044041-8-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250307174001.242794-8-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This allows to test folio_split() by specifying an additional in folio
page offset parameter to split_huge_page debugfs interface.
Link: https://lkml.kernel.org/r/20250307174001.242794-7-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Now split_huge_page_to_list_to_order() uses the new backend split code in
__split_unmapped_folio(), the old __split_huge_page() and
__split_huge_page_tail() can be removed.
Link: https://lkml.kernel.org/r/20250307174001.242794-6-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
folio_split() splits a large folio in the same way as buddy allocator
splits a large free page for allocation. The purpose is to minimize the
number of folios after the split. For example, if user wants to free the
3rd subpage in a order-9 folio, folio_split() will split the order-9 folio
as:
O-0, O-0, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-8 if it is anon,
since anon folio does not support order-1 yet.
-----------------------------------------------------------------
| | | | | | | | |
|O-0|O-0|O-0|O-0| O-2 |...| O-7 | O-8 |
| | | | | | | | |
-----------------------------------------------------------------
O-1, O-0, O-0, O-2, O-3, O-4, O-5, O-6, O-7, O-9 if it is pagecache
---------------------------------------------------------------
| | | | | | | |
| O-1 |O-0|O-0| O-2 |...| O-7 | O-8 |
| | | | | | | |
---------------------------------------------------------------
It generates fewer folios (i.e., 11 or 10) than existing page split
approach, which splits the order-9 to 512 order-0 folios. It also reduces
the number of new xa_node needed during a pagecache folio split from 8 to
1, potentially decreasing the folio split failure rate due to memory
constraints.
folio_split() and existing split_huge_page_to_list_to_order() share the
folio unmapping and remapping code in __folio_split() and the common
backend split code in __split_unmapped_folio() using uniform_split
variable to distinguish their operations.
uniform_split_supported() and non_uniform_split_supported() are added to
factor out check code and will be used outside __folio_split() in the
following commit.
Link: https://lkml.kernel.org/r/20250307174001.242794-5-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is a preparation patch for folio_split().
In the upcoming patch folio_split() will share folio unmapping and
remapping code with split_huge_page_to_list_to_order(), so move the code
to a common function __folio_split() first.
Add a TODO for splitting large shmem folio in swap cache.
Link: https://lkml.kernel.org/r/20250307174001.242794-4-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This is a preparation patch, both added functions are not used yet.
The added __split_unmapped_folio() is able to split a folio with its
mapping removed in two manners: 1) uniform split (the existing way), and
2) buddy allocator like (or non-uniform) split.
The added __split_folio_to_order() can split a folio into any lower order.
For uniform split, __split_unmapped_folio() calls it once to split the
given folio to the new order. For buddy allocator like (non-uniform)
split, __split_unmapped_folio() calls it (folio_order - new_order) times
and each time splits the folio containing the given page to one lower
order.
[ziy@nvidia.com: unfreeze head folio after page cache entries are updated]
Link: https://lkml.kernel.org/r/0F15DA7F-1977-412F-9A3E-F06B515D4BD2@nvidia.com
[ziy@nvidia.com: use NULL instead of 0 for folio->private assignment]
Link: https://lkml.kernel.org/r/1E11B9DD-3A87-4C9C-8FB4-E1324FB6A21A@nvidia.com
Link: https://lkml.kernel.org/r/20250307174001.242794-3-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Buddy allocator like (or non-uniform) folio split", v10.
This patchset adds a new buddy allocator like (or non-uniform) large folio
split from a order-n folio to order-m with m < n. It reduces
1. the total number of after-split folios from 2^(n-m) to n-m+1;
2. the amount of memory needed for multi-index xarray split from 2^(n/6-m/6) to
n/6-m/6, assuming XA_CHUNK_SHIFT=6;
3. keep more large folios after a split from all order-m folios to
order-(n-1) to order-m folios.
For example, to split an order-9 to order-0, folio split generates 10 (or
11 for anonymous memory) folios instead of 512, allocates 1 xa_node
instead of 8, and leaves 1 order-8, 1 order-7, ..., 1 order-1 and 2
order-0 folios (or 4 order-0 for anonymous memory) instead of 512 order-0
folios.
Instead of duplicating existing split_huge_page*() code, __folio_split()
is introduced as the shared backend code for both
split_huge_page_to_list_to_order() and folio_split(). __folio_split() can
support both uniform split and buddy allocator like (or non-uniform)
split. All existing split_huge_page*() users can be gradually converted
to use folio_split() if possible. In this patchset, I converted
truncate_inode_partial_folio() to use folio_split().
xfstests quick group passed for both tmpfs and xfs. I also
semi-replicated Hugh's test[12] and ran it without any issue for almost 24
hours.
This patch (of 8):
A preparation patch for non-uniform folio split, which always split a
folio into half iteratively, and minimal xarray entry split.
Currently, xas_split_alloc() and xas_split() always split all slots from a
multi-index entry. They cost the same number of xa_node as the
to-be-split slots. For example, to split an order-9 entry, which takes
2^(9-6)=8 slots, assuming XA_CHUNK_SHIFT is 6 (!CONFIG_BASE_SMALL), 8
xa_node are needed. Instead xas_try_split() is intended to be used
iteratively to split the order-9 entry into 2 order-8 entries, then split
one order-8 entry, based on the given index, to 2 order-7 entries, ...,
and split one order-1 entry to 2 order-0 entries. When splitting the
order-6 entry and a new xa_node is needed, xas_try_split() will try to
allocate one if possible. As a result, xas_try_split() would only need 1
xa_node instead of 8.
When a new xa_node is needed during the split, xas_try_split() can try to
allocate one but no more. -ENOMEM will be return if a node cannot be
allocated. -EINVAL will be return if a sibling node is split or cascade
split happens, where two or more new nodes are needed, and these are not
supported by xas_try_split().
xas_split_alloc() and xas_split() split an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
| | | |
------- --- --- -------
| | ... | |
V V V V
----------- ----------- ----------- -----------
| xa_node | | xa_node | ... | xa_node | | xa_node |
----------- ----------- ----------- -----------
xas_try_split() splits an order-9 to order-0:
---------------------------------
| | | | | | | | |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| | | | | | | | |
---------------------------------
|
|
V
-----------
| xa_node |
-----------
Link: https://lkml.kernel.org/r/20250307174001.242794-1-ziy@nvidia.com
Link: https://lkml.kernel.org/r/20250307174001.242794-2-ziy@nvidia.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kairui Song <kasong@tencent.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Remove a use of folio->page by passing the folio into
adjust_range_hwpoison(). We need to convert to a page eventually, but
that can happen inside adjust_range_hwpoison().
Link: https://lkml.kernel.org/r/20250226163131.3795869-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
pte_page() is more expensive than pte_pfn() (often it's defined as
pfn_to_page(pte_pfn())), so it makes sense to do the conversion to pfn
once (by calling folio_pfn()) rather than convert the pfn to a page each
time.
While this is a very small advantage, the main motivation is removing a
reference to folio->page.
Link: https://lkml.kernel.org/r/20250226163131.3795869-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fixes: 6769183166b3 ("mm/swap_cgroup: decouple swap cgroup recording and clearing")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
According to the code logic, the first parameter of the sub-function
__get_vm_area_node() should be size instead of real_size.
Then in __get_vm_area_node(), the size will be aligned, so the redundant
alignment operation is deleted.
The use of the real_size variable causes code redundancy, so it is removed
to simplify the code.
The real prefix is generally used to indicate the adjusted value of a
parameter, but according to the code logic, it should indicate the
original value, so it is recommended to rename it to original_align.
Link: https://lkml.kernel.org/r/20250306072131.800499-1-liuye@kylinos.cn
Signed-off-by: Liu Ye <liuye@kylinos.cn>
Reviewed-by: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Christop Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page_ext_next() function assumes that page extension objects for a
page order allocation always reside in the same memory section, which may
not be true and could lead to crashes. Use the new page_ext iteration API
instead.
Link: https://lkml.kernel.org/r/93c80b040960fa2ebab4a9729073f77a30649862.1741301089.git.luizcap@redhat.com
Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The page_ext_next() function assumes that page extension objects for a
page order allocation always reside in the same memory section, which may
not be true and could lead to crashes. Use the new page_ext iteration API
instead.
Link: https://lkml.kernel.org/r/ca2d53a020fe1cd65c442627ff6c0c40d591cbd8.1741301089.git.luizcap@redhat.com
Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm: page_ext: Introduce new iteration API", v3.
Introduction
============
[ Thanks to David Hildenbrand for identifying the root cause of this
issue and proving guidance on how to fix it. The new API idea, bugs
and misconceptions are all mine though ]
Currently, trying to reserve 1G pages with page_owner=on and sparsemem
causes a crash. The reproducer is very simple:
1. Build the kernel with CONFIG_SPARSEMEM=y and the table extensions
2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line
3. Reserve one 1G page at run-time, this should crash (see patch 1 for
backtrace)
[ A crash with page_table_check is also possible, but harder to trigger ]
Apparently, starting with commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP
for gigantic folios") we now pass the full allocation order to page
extension clients and the page extension implementation assumes that all
PFNs of an allocation range will be stored in the same memory section (which
is not true for 1G pages).
To fix this, this series introduces a new iteration API for page extension
objects. The API checks if the next page extension object can be retrieved
from the current section or if it needs to look up for it in another
section.
Please, find all details in patch 1.
I tested this series on arm64 and x86 by reserving 1G pages at run-time
and doing kernel builds (always with page_owner=on and page_table_check=on).
This patch (of 3):
The page extension implementation assumes that all page extensions of a
given page order are stored in the same memory section. The function
page_ext_next() relies on this assumption by adding an offset to the
current object to return the next adjacent page extension.
This behavior works as expected for flatmem but fails for sparsemem when
using 1G pages. The commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for
gigantic folios") exposes this issue, making it possible for a crash when
using page_owner or page_table_check page extensions.
The problem is that for 1G pages, the page extensions may span memory
section boundaries and be stored in different memory sections. This issue
was not visible before commit cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP
for gigantic folios") because alloc_contig_pages() never passed more than
MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing
mentioned commit changed this behavior allowing the full 1G page order to
be passed.
Reproducer:
1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions
support
2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line
3. Reserve one 1G page at run-time, this should crash (backtrace below)
To address this issue, this commit introduces a new API for iterating
through page extensions. The main iteration macro is for_each_page_ext()
and it must be called with the RCU read lock taken. Here's an usage
example:
"""
struct page_ext_iter iter;
struct page_ext *page_ext;
...
rcu_read_lock();
for_each_page_ext(page, 1 << order, page_ext, iter) {
struct my_page_ext *obj = get_my_page_ext_obj(page_ext);
...
}
rcu_read_unlock();
"""
The loop construct uses page_ext_iter_next() which checks to see if we
have crossed sections in the iteration. In this case,
page_ext_iter_next() retrieves the next page_ext object from another
section.
Thanks to David Hildenbrand for helping identify the root cause and
providing suggestions on how to fix and optmize the solution (final
implementation and bugs are all mine through).
Lastly, here's the backtrace, without kasan you can get random crashes:
[ 76.052526] BUG: KASAN: slab-out-of-bounds in __update_page_owner_handle+0x238/0x298
[ 76.060283] Write of size 4 at addr ffff07ff96240038 by task tee/3598
[ 76.066714]
[ 76.068203] CPU: 88 UID: 0 PID: 3598 Comm: tee Kdump: loaded Not tainted 6.13.0-rep1 #3
[ 76.076202] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10
[ 76.088972] Call trace:
[ 76.091411] show_stack+0x20/0x38 (C)
[ 76.095073] dump_stack_lvl+0x80/0xf8
[ 76.098733] print_address_description.constprop.0+0x88/0x398
[ 76.104476] print_report+0xa8/0x278
[ 76.108041] kasan_report+0xa8/0xf8
[ 76.111520] __asan_report_store4_noabort+0x20/0x30
[ 76.116391] __update_page_owner_handle+0x238/0x298
[ 76.121259] __set_page_owner+0xdc/0x140
[ 76.125173] post_alloc_hook+0x190/0x1d8
[ 76.129090] alloc_contig_range_noprof+0x54c/0x890
[ 76.133874] alloc_contig_pages_noprof+0x35c/0x4a8
[ 76.138656] alloc_gigantic_folio.isra.0+0x2c0/0x368
[ 76.143616] only_alloc_fresh_hugetlb_folio.isra.0+0x24/0x150
[ 76.149353] alloc_pool_huge_folio+0x11c/0x1f8
[ 76.153787] set_max_huge_pages+0x364/0xca8
[ 76.157961] __nr_hugepages_store_common+0xb0/0x1a0
[ 76.162829] nr_hugepages_store+0x108/0x118
[ 76.167003] kobj_attr_store+0x3c/0x70
[ 76.170745] sysfs_kf_write+0xfc/0x188
[ 76.174492] kernfs_fop_write_iter+0x274/0x3e0
[ 76.178927] vfs_write+0x64c/0x8e0
[ 76.182323] ksys_write+0xf8/0x1f0
[ 76.185716] __arm64_sys_write+0x74/0xb0
[ 76.189630] invoke_syscall.constprop.0+0xd8/0x1e0
[ 76.194412] do_el0_svc+0x164/0x1e0
[ 76.197891] el0_svc+0x40/0xe0
[ 76.200939] el0t_64_sync_handler+0x144/0x168
[ 76.205287] el0t_64_sync+0x1ac/0x1b0
Link: https://lkml.kernel.org/r/cover.1741301089.git.luizcap@redhat.com
Link: https://lkml.kernel.org/r/a45893880b7e1601082d39d2c5c8b50bcc096305.1741301089.git.luizcap@redhat.com
Fixes: cf54f310d0d3 ("mm/hugetlb: use __GFP_COMP for gigantic folios")
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Luiz Capitulino <luizcap@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
It is the responsibility of the caller to check pmd_none(); in any case,
we are not achieving anything by returning since there is no return value
to tell the caller that we succeeded or not. So remove this check.
Link: https://lkml.kernel.org/r/20250306144315.21907-1-dev.jain@arm.com
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcow (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The operations layer hook was introduced to let operations set do any
aggregation data reset if needed. But it is not really be used now.
Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The hook was introduced to let DAMON kernel API users access DAMOS
schemes-eligible regions in a safe way. Now it is no more used by anyone,
and the functionality is provided in a better way by damos_walk(). Remove
it.
Link: https://lkml.kernel.org/r/20250306175908.66300-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The callback was used by DAMON sysfs interface for reading DAMON internal
data. But it is no more being used, and damon_call() can do similar works
in a better way. Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The function pointer field was added to be used as a place to do some
initialization works just before DAMON starts working. However, nobody is
using it now. Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The field was added to let users keep their personal data to use inside of
the callbacks. However, no one is actively using that now. Remove it.
Link: https://lkml.kernel.org/r/20250306175908.66300-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_schemes_clear_regions()
The comment on damon_sysfs_schemes_clear_regions() function is obsolete,
since it has updated to directly called from DAMON sysfs interface code.
Remove the outdated comment.
Link: https://lkml.kernel.org/r/20250306175908.66300-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_cmd_request is DAMON sysfs interface's own synchronization
mechanism for accessing DAMON internal data via damon_callback hooks. All
the users are now migrated to damon_call() and damos_walk(), so nobody
really uses it. No one writes to the data structure but reading code is
still remained. Remove the reading code and the entire data structure.
Link: https://lkml.kernel.org/r/20250306175908.66300-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_cmd_request_callback() is the damon_callback hook functions
that were used to handle user requests that need to read and/or write
DAMON internal data. All the usages are now updated to use damon_call()
or damos_walk(), though. Remove it and its callers.
Link: https://lkml.kernel.org/r/20250306175908.66300-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
damon_sysfs_handle_cmd()
damon_sysfs_handle_cmd() handles user requests that it can directly handle
on its own. For requests that need to be handled from damon_callback
hooks, it uses DAMON sysfs interface's own synchronous damon_callback
hooks management mechanism, namely damon_sysfs_cmd_request. Now all user
requests are handled without damon_callback hooks, so
damon_sysfs_cmd_request client code in damon_sysfs_andle_cmd() does
nothing in real. Remove the unnecessary code.
Link: https://lkml.kernel.org/r/20250306175908.66300-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
DAMON sysfs interface is using damon_callback->after_aggregation hook with
its self-implemented synchronization mechanism for the hook. It is
inefficient, complicated, and take up to one aggregation interval to
complete, which can be long on some configs.
Use damon_call() instead. It provides a synchronization mechanism that
built inside DAMON's core layer, so more efficient than DAMON sysfs
interface's own one. Also it isolates the implementation inside the core
layer, and hence it makes the code easier to maintain. Finally, it takes
up to one sampling interval, which is much shorter than the aggregation
interval in common setups.
Link: https://lkml.kernel.org/r/20250306175908.66300-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently all DAMON kernel API callers do online DAMON parameters commit
from damon_callback->after_aggregation because only those are safe place
to call the DAMON monitoring attributes update function, namely
damon_set_attrs().
Because damon_callback hooks provide no synchronization, the callers work
in asynchronous ways or implement their own inefficient and complicated
synchronization mechanisms. It also means online DAMON parameters commit
can take up to one aggregation interval. On large systems having long
aggregation intervals, that can be too slow. The synchronization can be
done in more efficient and simple way while removing the latency
constraint if it can be done using damon_call().
The fact that damon_call() can be executed in the middle of the
aggregation makes damon_set_attrs() unsafe to be called from it, though.
Two real problems can occur in the case. First, converting the not yet
completely aggregated nr_accesses for new user-set intervals can arguably
degrade the accuracy or at least make the logic complicated. Second,
kdamond_reset_aggregated() will not be called after the monitoring results
update, so next aggregation starts from unclean state. This can result in
inconsistent and unexpected nr_accesses_bp.
Make it safe as follows. Catch the middle-of-the-aggregation case from
damon_set_attrs() by checking the passed_sample_intervals and
next_aggregationsis of the context. And pass the information to
nr_accesses conversion logic. The logic works as before if it is not the
case (called after the current aggregation is completed). If it is the
case (committing parameters in the middle of the aggregation), it drops
the nr_accesses information that so far aggregated, and make the status
same to the beginning of this aggregation, but as if the last aggregation
was started with the updated sampling/aggregation intervals.
The middle-of-aggregastion check introduce yet another edge case, though.
This happens because kdamond_tune_intervals() can also call
damon_set_attrs() with the middle-of-aggregation check. Consider
damon_call() for parameters commit and kdamond_tune_intervals() are called
in same iteration of kdamond main loop. Because kdamond_tune_interval()
is called for aggregation intervals, it should be the end of the
aggregation. The first damon_set_attrs() call from kdamond_call()
understands it is the end of the aggregation and correctly handle it.
But, because the damon_set_attrs() updated next_aggregation_sis of the
context. Hence, the second damon_set_attrs() invocation from
kdamond_tune_interval() believes it is called in the middle of the
aggregation. It therefore resets aggregated information so far. After
that, kdamond_reset_interval() is called and double-reset the aggregated
information. Avoid this case, too, by setting the next_aggregation_sis
before kdamond_tune_intervals() is invoked.
Link: https://lkml.kernel.org/r/20250306175908.66300-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
kdamond_call() callers may iterate the regions, so better to call it when
the number of regions is as small as possible. It is when
kdamond_merge_regions() is finished. Invoke it on the point.
This change is also aimed to make future changes for carrying online
parameters commit with damon_call() easier. The commit operation should
be able to make sequence between other aggregation interval based
operations including regioins merging and aggregation reset. Placing
damon_call() invocation after the regions merging makes the sequence
handling simpler.
Link: https://lkml.kernel.org/r/20250306175908.66300-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "mm/damon/sysfs: commit parameters online via damon_call()".
Due to the lack of ways to synchronously access DAMON internal data, DAMON
sysfs interface is using damon_callback hooks with its own synchronization
mechanism. The mechanism is built on top of damon_callback hooks in an
ineifficient and complicated way.
Patch series "mm/damon: replace most damon_callback usages in sysfs with
new core functions", which starts with commit e035320fd38e
("mm/damon/sysfs-schemes: remove unnecessary schemes existence check in
damon_sysfs_schemes_clear_regions()") introduced two new DAMON kernel API
functions that providing the synchronous access, replaced most
damon_callback hooks usage in DAMON sysfs interface, and cleaned up
unnecessary code.
Continue the replacement and cleanup works. Update the last DAMON sysfs'
usage of its own synchronization mechanism, namely online DAMON parameters
commit, to use damon_call() instead of the damon_callback hooks and the
hard-to-maintain core-external synchronization mechanism. Then remove the
no more be used code due to the change, and more unused code that just not
yet cleaned up.
The first four patches (patches 1-4) of this series makes DAMON sysfs
interface's online parameters commit to use damon_call(). Then, following
three patches (patches 5-7) remove the DAMON sysfs interface's own
synchronization mechanism and its usages, which is no more be used by
anyone due to the first four patches. Finally, six patches (8-13) do more
cleanup of outdated comment and unused code.
This patch (of 13):
Online DAMON parameters commit via DAMON sysfs interface can make kdamond
stop. This behavior was made because it can make the implementation
simpler. The implementation tries committing the parameter without
validation. If it finds something wrong in the middle of the parameters
update, it returns error without reverting the partially committed
parameters back. It is safe though, since it immediately breaks kdamond
main loop in the case of the error return.
Users can make the wrong parameters by mistake, though. Stopping kdamond
in the case is not very useful behavior. Also this makes it difficult to
utilize damon_call() instead of damon_callback hook for online parameters
update, since damon_call() cannot immediately break kdamond main loop in
the middle.
Validate the input parameters and return error when it fails before
starting parameters updates. In case of mistakenly wrong parameters,
kdamond can continue running with the old and valid parameters.
Link: https://lkml.kernel.org/r/20250306175908.66300-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20250306175908.66300-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The point where the memory is released from memblock to the buddy
allocator is hidden inside arch-specific mem_init()s and the call to
memblock_free_all() is needlessly duplicated in every artiste cure and
after introduction of arch_mm_preinit() hook, mem_init() implementation on
many architecture only contains the call to memblock_free_all().
Pull memblock_free_all() call into mm_core_init() and drop mem_init() on
relevant architectures to make it more explicit where the free memory is
released from memblock to the buddy allocator and to reduce code
duplication in architecture specific code.
Link: https://lkml.kernel.org/r/20250313135003.836600-14-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Currently, implementation of mem_init() in every architecture consists of
one or more of the following:
* initializations that must run before page allocator is active, for
instance swiotlb_init()
* a call to memblock_free_all() to release all the memory to the buddy
allocator
* initializations that must run after page allocator is ready and there is
no arch-specific hook other than mem_init() for that, like for example
register_page_bootmem_info() in x86 and sparc64 or simple setting of
mem_init_done = 1 in several architectures
* a bunch of semi-related stuff that apparently had no better place to
live, for example a ton of BUILD_BUG_ON()s in parisc.
Introduce arch_mm_preinit() that will be the first thing called from
mm_core_init(). On architectures that have initializations that must happen
before the page allocator is ready, move those into arch_mm_preinit() along
with the code that does not depend on ordering with page allocator setup.
On several architectures this results in reduction of mem_init() to a
single call to memblock_free_all() that allows its consolidation next.
Link: https://lkml.kernel.org/r/20250313135003.836600-13-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
All architectures that support HIGHMEM have their code that frees high
memory pages to the buddy allocator while __free_memory_core() is limited
to freeing only low memory.
There is no actual reason for that. The memory map is completely ready by
the time memblock_free_all() is called and high pages can be released to
the buddy allocator along with low memory.
Remove low memory limit from __free_memory_core() and drop per-architecture
code that frees high memory pages.
Link: https://lkml.kernel.org/r/20250313135003.836600-12-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
high_memory defines upper bound on the directly mapped memory. This bound
is defined by the beginning of ZONE_HIGHMEM when a system has high memory
and by the end of memory otherwise.
All this is known to generic memory management initialization code that
can set high_memory while initializing core mm structures.
Add a generic calculation of high_memory to free_area_init() and remove
per-architecture calculation except for the architectures that set and use
high_memory earlier than that.
Link: https://lkml.kernel.org/r/20250313135003.836600-11-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
max_mapnr is essentially the size of the memory map for systems that use
FLATMEM. There is no reason to calculate it in each and every architecture
when it's anyway calculated in alloc_node_mem_map().
Drop setting of max_mapnr from architecture code and set it once in
alloc_node_mem_map().
While on it, move definition of mem_map and max_mapnr to mm/mm_init.c so
there won't be two copies for MMU and !MMU variants.
Link: https://lkml.kernel.org/r/20250313135003.836600-10-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com> [x86]
Tested-by: Mark Brown <broonie@kernel.org>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
This will help with pulling out memblock_free_all() to the generic
code and reducing code duplication in arch::mem_init().
Link: https://lkml.kernel.org/r/20250313135003.836600-9-rppt@kernel.org
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren (csky) <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russel King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|