kernel/linux.git/mm/page_alloc.c, branch v7.2-rc1

mm/page_alloc: only update NUMA min ratios on sysctl write

2026-06-21T18:31:29+00:00

The sysctl handlers for min_unmapped_ratio and min_slab_ratio invoke setup_min_unmapped_ratio() and setup_min_slab_ratio() unconditionally after proc_dointvec_minmax(), even for read operations. These setup functions first zero all per-NUMA node thresholds (min_unmapped_pages and min_slab_pages) before recalculating them. Reading /proc sysctl entries therefore temporarily resets node reclaim thresholds to zero, which may disturb the behavior of __node_reclaim() and node_reclaim() during the recomputation. Fix this by only calling the setup functions when the sysctl is actually written (write == 1), matching the behavior of existing sysctl handlers like min_free_kbytes and watermark_scale_factor. This only affects systems with CONFIG_NUMA. Link: https://lore.kernel.org/tencent_5891052AF9A4C2D490A62F478D446F74AB09@qq.com Signed-off-by: Jianlin Shi Cc: Brendan Jackman Cc: Johannes Weiner Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Zi Yan Signed-off-by: Andrew Morton

mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list

2026-06-21T18:31:28+00:00

Pages allocated before page_ext is available have their codetag left uninitialized. Track these early PFNs and clear their codetag in clear_early_alloc_pfn_tag_refs() to avoid "alloc_tag was not set" warnings when they are freed later. Currently a fixed-size array of 8192 entries is used, with a warning if the limit is exceeded. However, the number of early allocations depends on the number of CPUs and can be larger than 8192. Replace the fixed-size array with a dynamically allocated linked list of pfn_pool structs. Each node is allocated via alloc_page() and mapped to a pfn_pool containing a next pointer, an atomic slot counter, and a PFN array that fills the remainder of the page. The tracking pages themselves are allocated via alloc_page(), which would trigger __pgalloc_tag_add() -> alloc_tag_add_early_pfn() and recurse indefinitely. Introduce __GFP_NO_CODETAG (reuses the %__GFP_NO_OBJ_EXT bit) and pass gfp_flags through pgalloc_tag_add() so that the early path can skip recording allocations that carry this flag. Link: https://lore.kernel.org/20260604024008.46592-1-hao.ge@linux.dev Signed-off-by: Hao Ge Suggested-by: Suren Baghdasaryan Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Signed-off-by: Andrew Morton

mm/page_alloc: fix deferred compaction accounting

2026-06-09T01:21:26+00:00

COMPACT_DEFERRED means compaction did not start because past failures caused the zone to be deferred. try_to_compact_pages() returns the maximum result seen while walking the zonelist, so a final COMPACT_DEFERRED result means no later zone reported that compaction actually ran. __alloc_pages_direct_compact() skips COMPACTSTALL and COMPACTFAIL accounting when try_to_compact_pages() returns COMPACT_SKIPPED, but not when it returns COMPACT_DEFERRED. A deferred-only direct compaction attempt can therefore look like a stall, and then a failure if the allocation still cannot be satisfied. Treat COMPACT_DEFERRED like COMPACT_SKIPPED in this accounting path. If a later zone runs compaction and returns a result above COMPACT_DEFERRED, or compact_zone_order() reports COMPACT_SUCCESS for a captured page, the final result is not COMPACT_DEFERRED and the existing accounting still runs. Link: https://lore.kernel.org/tencent_368AF1F3821E46232637BE16D65C45CF3308@qq.com Fixes: 06dac2f467fe ("mm: compaction: update the COMPACT[STALL|FAIL] events properly") Signed-off-by: fujunjie Reviewed-by: Vlastimil Babka (SUSE) Cc: Brendan Jackman Cc: Johannes Weiner Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/compaction: respect cpusets when checking retry suitability

2026-06-09T01:21:25+00:00

should_compact_retry() handles COMPACT_SKIPPED by asking compaction_zonelist_suitable() whether reclaim can make a later compaction attempt worthwhile. That answer is used for the current allocation, so it should follow the same zone eligibility rules as the allocation itself. When cpusets are enabled, allocator slowpath decisions are marked with ALLOC_CPUSET. The allocation path, direct compaction and reclaim retry all skip zones rejected by __cpuset_zone_allowed(). compaction_zonelist_suitable() does not apply that filter. It only walks ac->zonelist/ac->nodemask, so it can return true because a zone that is not usable for the current allocation would pass __compaction_suitable(). That does not let the allocation use the disallowed zone. Later allocation and direct compaction paths still apply cpuset filtering. However, it can make should_compact_retry() retry based on memory that this allocation cannot use. Pass gfp_mask down and apply the same ALLOC_CPUSET check in compaction_zonelist_suitable(). This keeps the retry decision aligned with the zones that the allocation is allowed to use. A temporary debugfs probe was also used to call the old and new compaction_zonelist_suitable() predicates in the same two-node NUMA guest. The task was restricted to mems=0 while ac->nodemask covered nodes 0-1. After putting pressure on node0, node0 failed __compaction_suitable() for order-10 and node1 passed it, but node1 was rejected by __cpuset_zone_allowed(). In that state the old predicate returned true and the patched predicate returned false. Link: https://lore.kernel.org/tencent_F59F2BA2CC5779308E10DF54593C736D3E0A@qq.com Fixes: 435b3894e742 ("mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()") Signed-off-by: fujunjie Reviewed-by: Vlastimil Babka (SUSE) Cc: Brendan Jackman Cc: Johannes Weiner Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/page_alloc: remove VM_BUG_ON()s from pindex helpers

2026-06-04T21:45:03+00:00

Vlastimil pointed out that the VM_BUG_ON()s have fallen out of favour, so remove them. Link: https://lore.kernel.org/20260526-page_alloc-unmapped-prep-v2-1-412f4d486115@google.com Signed-off-by: Brendan Jackman Suggested-by: Vlastimil Babka (SUSE) Link: https://lore.kernel.org/all/4074a816-9e75-45a6-8141-25459bcc106b@kernel.org/ Reviewed-by: Vlastimil Babka (SUSE) Cc: Johannes Weiner Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/page_alloc: fix defrag_mode for non-reclaimable allocations

2026-06-04T21:44:59+00:00

When defrag_mode is enabled, ALLOC_NOFRAGMENT is enforced to prevent migratetype fallbacks and keep pageblocks clean. The allocator relies on reclaim and compaction to free pages of the correct type before allowing fallback as a last resort. However, non-reclaimable allocations such as GFP_ATOMIC cannot invoke direct reclaim or compaction. With defrag_mode=1, these allocations hit the !can_direct_reclaim bailout in __alloc_pages_slowpath() with ALLOC_NOFRAGMENT still set, and fail without ever attempting a fallback. This causes a large number of SLUB allocation failures for skbuff_head_cache under network-heavy workloads, despite free memory being available in other migratetype freelists. We observed it on a few of the Meta workloads that adopted defrag_mode=1. For the service under load there were 85509 SLUB allocation failures messages in dmesg within 2 hours. All of them are GFP_ATOMIC allocations for skbuff_head_cache, despite free pages being available in other migratetype freelists (~13 GB free). Since it is networking path from the practical point of view, this means dropped packets, failed RPC requests, tail latency spikes and overall service degradation. Clear ALLOC_NOFRAGMENT and retry for allocations that request kswapd reclaim but cannot do direct reclaim themselves (GFP_ATOMIC). Purely speculative allocations like GFP_TRANSHUGE_LIGHT that don't set __GFP_KSWAPD_RECLAIM are left to fail, since they have reasonable fallbacks and should not cause fragmentation. Link: https://lore.kernel.org/20260520122228.201550-1-d@ilvokhin.com Fixes: e3aa7df331bc ("mm: page_alloc: defrag_mode") Signed-off-by: Dmitry Ilvokhin Acked-by: Johannes Weiner Acked-by: Vlastimil Babka (SUSE) Cc: Brendan Jackman Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/page_alloc: document that alloc_pages_nolock() uses RCU

2026-06-02T22:22:20+00:00

The allocator interacts with cgroups which rely on RCU. RCU does not work everywhere, so the "any context" claim is slightly overstated here. This should already be enforced by objtool, since this function is not marked noinstr the x86 build should fail if you call it from a place where RCU is not watching. But, expecting readers to make that connection for themselves seems a bit cruel (I don't think there is even any documentation of what noinstr means at all, let alone the connection with RCU). Note this is not claiming that any cgroup code called from the allocator would actually break if this restriction was violated, it could very well be that there's no real way for the allocator to act on a cgroup that can disappear concurrently. But, since it's likely nobody has verified this one way or another, better to just be safe and declare that RCU is required. Allocating from an RCU-unsafe context seems a bit crazy anyway. Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman Suggested-by: Junaid Shahid Acked-by: Harry Yoo (Oracle) Acked-by: Vlastimil Babka (SUSE) Cc: Alexei Starovoitov Cc: Johannes Weiner Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/page_alloc: drop a misleading __always_inline

2026-06-02T22:22:20+00:00

get_pfnblock_migratetype() is called from outside page_alloc.c, so it cannot always be inlined. Remove the annotation to avoid misleading readers. At least in my minimal config, with GCC, this doesn't change mm/page_alloc.o at all. Link: https://lore.kernel.org/all/20260517-b4-drop-always-inline-v1-1-97b90930e8b8@google.com/ Signed-off-by: Brendan Jackman Suggested-by: Vlastimil Babka Link: https://lore.kernel.org/all/016c8bef-57ef-44ef-bf60-86dbfd368dcd@kernel.org/ Acked-by: Johannes Weiner Reviewed-by: SeongJae Park Reviewed-by: Vishal Moola Reviewed-by: Vlastimil Babka (SUSE) Cc: Michal Hocko Cc: Suren Baghdasaryan Cc: Zi Yan Signed-off-by: Andrew Morton

mm/page_alloc: remove ifdefs from pindex helpers

2026-06-02T22:22:19+00:00

The ifdefs are not technically needed here, everything used here is always defined. Switching to IS_ENABLED() makes the code a bit less tiresome to read. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-4-dacdf5402be8@google.com Signed-off-by: Brendan Jackman Reviewed-by: Vlastimil Babka (SUSE) Cc: Axel Rasmussen Cc: Barry Song Cc: David Hildenbrand Cc: Johannes Weiner Cc: Kairui Song Cc: Len Brown Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport (Microsoft) Cc: "Rafael J. Wysocki" Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton

mm: rejig pageblock mask definitions

2026-06-02T22:22:19+00:00

- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global namespace" too much. - This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well, that global mask only exists for quite a specific purpose, and is quite a weird thing to have a name for anyway. So drop it and take advantage of the newly-defined PAGEBLOCK_ISO_MASK. Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com Signed-off-by: Brendan Jackman Reviewed-by: Vlastimil Babka (SUSE) Cc: Axel Rasmussen Cc: Barry Song Cc: David Hildenbrand Cc: Johannes Weiner Cc: Kairui Song Cc: Len Brown Cc: Liam R. Howlett Cc: Lorenzo Stoakes Cc: Michal Hocko Cc: Mike Rapoport (Microsoft) Cc: "Rafael J. Wysocki" Cc: Shakeel Butt Cc: Suren Baghdasaryan Cc: Wei Xu Cc: Yuanchu Xie Cc: Zi Yan Signed-off-by: Andrew Morton