kernel/linux.git/mm/memcontrol.c, branch v7.2-rc1

Merge tag 'slab-for-7.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

2026-06-22T15:28:48+00:00

Pull more slab updates from Vlastimil Babka: - Introduce and wire up a new alloc_flags parameter for modifying slab-specific behavior without adding or reusing gfp flags. Also introduce slab_alloc_context to keep function parameter bloat in check. Both are similar to what the page allocator does. kmalloc_flags() exposes alloc_flags for mm-internal users. - SLAB_ALLOC_NOLOCK flag is used to implement kmalloc_nolock() behavior without relying on lack of __GFP_RECLAIM, which caused false positives with workarounds like fd3634312a04 ("debugobject: Make it work with deferred page initialization - again"). - SLAB_ALLOC_NO_RECURSE replaces __GFP_NO_OBJ_EXT, which could have been removed, but pending memory allocation profiling changes in mm tree have grown a new user - there is however a work ongoing to replace that too, so __GFP_NO_OBJ_EXT should eventually be removed. (Vlastimil Babka) - Add kmem_buckets_alloc_track_caller() with a user to be added in the net tree (Pedro Falcato) - Fixes for kernel-doc and slabinfo (Randy Dunlap, Yichong Chen) * tag 'slab-for-7.2-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: tools/mm/slabinfo: fix total_objects attribute name slab: recognize @GFP parameter as optional in kernel-doc mm/slab: add a node-track-caller variant for kmem buckets allocation mm/slab: replace __GFP_NO_OBJ_EXT with SLAB_ALLOC_NO_RECURSE for sheaves mm/slab: remove __GFP_NO_OBJ_EXT usage from alloc_slab_obj_exts() mm/slab: introduce kmalloc_flags() mm/slab: allow __GFP_NOMEMALLOC and __GFP_NOWARN for kmalloc_nolock() mm/slab: pass slab_alloc_context to __do_kmalloc_node() mm/slab: allow kmem_cache_alloc_bulk() with any gfp flags mm/slab: replace slab_alloc_node() parameters with slab_alloc_context mm/slab: pass alloc_flags through slab_post_alloc_hook() chain mm/slab: pass alloc_flags to new slab allocation mm/slab: add alloc_flags to slab_alloc_context mm/slab: replace struct partial_context with slab_alloc_context mm/slab: introduce alloc_flags and SLAB_ALLOC_NOLOCK mm/slab: introduce slab_alloc_context mm/slab: stop inlining __slab_alloc_node() mm/slab: do not init any kfence objects on allocation

mm/slab: pass alloc_flags through slab_post_alloc_hook() chain

2026-06-15T11:23:19+00:00

Convert the whole following call stack to pass either slab_alloc_context (thus including alloc_flags) or just alloc_flags as necessary: slab_post_alloc_hook() alloc_tagging_slab_alloc_hook() __alloc_tagging_slab_alloc_hook() prepare_slab_obj_exts_hook() alloc_slab_obj_exts() memcg_slab_post_alloc_hook() __memcg_slab_post_alloc_hook() alloc_slab_obj_exts() Converting all these at once avoids unnecessary churn and is mostly mechanical. This ultimately allows to decide if spinning is allowed using alloc_flags in alloc_slab_obj_exts(), as well as slab_post_alloc_hook(). Aside from alloc_from_pcs_bulk() (to be handled next) there is nothing else in slab itself relying on gfpflags_allow_spinning() which can be false even if not called from kmalloc_nolock(). A followup change will also use the alloc_flags availability in the call stack above to remove the __GFP_NO_OBJ_EXT flag. For alloc_slab_obj_exts(), also replace the suboptimal "bool new_slab" parameter with a SLAB_ALLOC_NEW_SLAB flag with identical functionality. To further reduce the number of parameters of slab_post_alloc_hook(), also make 'struct list_lru *lru' (which is NULL for most callers) a new field of slab_alloc_context. Link: https://patch.msgid.link/20260610-slab_alloc_flags-v2-9-7190909db118@kernel.org Reviewed-by: Harry Yoo (Oracle) Reviewed-by: Suren Baghdasaryan Reviewed-by: Hao Li Signed-off-by: Vlastimil Babka (SUSE)

mm: switch deferred split shrinker to list_lru

2026-06-09T01:21:25+00:00

The deferred split queue handles cgroups in a suboptimal fashion. The queue is per-NUMA node or per-cgroup, not the intersection. That means on a cgrouped system, a node-restricted allocation entering reclaim can end up splitting large pages on other nodes: alloc/unmap deferred_split_folio() list_add_tail(memcg->split_queue) set_shrinker_bit(memcg, node, deferred_shrinker_id) for_each_zone_zonelist_nodemask(restricted_nodes) mem_cgroup_iter() shrink_slab(node, memcg) shrink_slab_memcg(node, memcg) if test_shrinker_bit(memcg, node, deferred_shrinker_id) deferred_split_scan() walks memcg->split_queue The shrinker bit adds an imperfect guard rail. As soon as the cgroup has a single large page on the node of interest, all large pages owned by that memcg, including those on other nodes, will be split. list_lru properly sets up per-node, per-cgroup lists. As a bonus, it streamlines a lot of the list operations and reclaim walks. It's used widely by other major shrinkers already. Convert the deferred split queue as well. The list_lru per-memcg heads are instantiated on demand when the first object of interest is allocated for a cgroup, by calling folio_memcg_alloc_deferred(). Add calls to where splittable pages are created: anon faults, swapin faults, khugepaged collapse. These calls create all possible node heads for the cgroup at once, so the migration code (between nodes) doesn't need any special care. [akpm@linux-foundation.org: fix build with CONFIG_TRANSPARENT_HUGEPAGE=n] Link: https://lore.kernel.org/202605281620.lc3rtkBm-lkp@intel.com [hannes@cmpxchg.org: fix cgroup.memory=nokmem handling] Link: https://lore.kernel.org/ah9PGv12mqai84ES@cmpxchg.org Link: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org Signed-off-by: Johannes Weiner Reported-by: Mikhail Zaslonko Tested-by: Mikhail Zaslonko Acked-by: Shakeel Butt Reviewed-by: Lorenzo Stoakes (Oracle) Acked-by: Usama Arif Reviewed-by: Kairui Song Cc: Baolin Wang Cc: Barry Song Cc: Dave Chinner Cc: David Hildenbrand (Arm) Cc: Dev Jain Cc: Lance Yang Cc: Liam R. Howlett Cc: Michal Hocko Cc: Muchun Song Cc: Nico Pache Cc: Roman Gushchin Cc: Ryan Roberts Cc: Vasily Gorbik Cc: Vlastimil Babka Cc: Zi Yan Cc: kernel test robot Signed-off-by: Andrew Morton

memcg: multi objcg charge support

2026-06-04T21:45:07+00:00

Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") split a memcg's single obj_cgroup into one per NUMA node so that reparenting LRU folios can take per-node lru locks. As a side effect, the per-CPU obj_stock_pcp -- which caches exactly one cached_objcg -- thrashes on workloads where threads of the same memcg run on different NUMA nodes. The kernel test robot reported a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern. Mirror the multi-slot pattern already used by memcg_stock_pcp: turn nr_bytes and cached_objcg into NR_OBJ_STOCK-element arrays, scan all slots on consume/refill/account, prefer empty slots when inserting, and evict a slot round-robin only when full. With multiple slots a CPU can hold the per-node objcg variants of one memcg plus a few siblings without ever forcing a drain. A single int8_t index records which slot the cached slab stats belong to; the stats are flushed on slot or pgdat change. With NR_OBJ_STOCK = 5 the layout (verified with pahole) is: offset 0 : lock(1) + index(1) + node_id(2) + slab stats(4) = 8B offset 8 : nr_bytes[5] = 10B offset 18 : padding = 6B offset 24 : cached[5] = 40B offset 64 : (line 2) work_struct + flags (cold) so consume_obj_stock, refill_obj_stock and the slab account path each touch exactly one 64-byte cache line on non-debug 64-bit builds. Link: https://lore.kernel.org/20260526033931.1760588-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202605121641.b6a60cb0-lkp@intel.com Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Tested-by: kernel test robot Reviewed-by: Harry Yoo (Oracle) Cc: Alexandre Ghiti Cc: Johannes Weiner Cc: Joshua Hahn Cc: Michal Hocko Cc: Muchun Song Cc: Qi Zheng Cc: Roman Gushchin Signed-off-by: Andrew Morton

memcg: int16_t for cached slab stats

2026-06-04T21:45:07+00:00

Currently struct obj_stock_pcp stores cached slab stats in 'int' which is 4 bytes per counter on 64-bit machines. Switch them to int16_t to shrink the cached metadata. The existing PAGE_SIZE flush in __account_obj_stock() bounds *bytes at PAGE_SIZE on 4KiB and 16KiB page archs, well within int16_t. On 64KiB pages PAGE_SIZE is well above S16_MAX so that flush never fires, and a sufficiently long run of accumulations would overflow the cache. Add an explicit S16_MAX guard before each add: when the next add would push abs(*bytes) past S16_MAX, fold the cached value into @nr and flush directly via mod_objcg_mlstate() before the accumulation. Link: https://lore.kernel.org/20260526033931.1760588-4-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt Tested-by: kernel test robot Reviewed-by: Harry Yoo (Oracle) Acked-by: Qi Zheng Acked-by: Muchun Song Cc: Alexandre Ghiti Cc: Johannes Weiner Cc: Joshua Hahn Cc: Michal Hocko Cc: Roman Gushchin Signed-off-by: Andrew Morton

memcg: uint16_t for nr_bytes in obj_stock_pcp

2026-06-04T21:45:07+00:00

Currently struct obj_stock_pcp stores nr_bytes in an 'unsigned int' which is 4 bytes on 64-bit machines. Switch the field to uint16_t to shrink the per-CPU cache. The kernel supports PAGE_SIZE_4KB, _8KB, _16KB, _32KB, _64KB and _256KB (see HAVE_PAGE_SIZE_* in arch/Kconfig). After the PAGE_SIZE-aligned flush in __refill_obj_stock(), the sub-page remainder fits in uint16_t up through 64KiB pages where PAGE_SIZE - 1 == U16_MAX, but on 256KiB pages PAGE_SIZE - 1 == 0x3FFFF exceeds U16_MAX. The accumulator also needs to stay within uint16_t between page-aligned flushes on 64KiB pages where PAGE_SIZE itself is U16_MAX + 1. Accumulate the new total in an 'unsigned int' local, then on PAGE_SHIFT <= 16 flush whenever the accumulator would hit U16_MAX; together with the existing allow_uncharge flush at PAGE_SIZE this keeps the uint16_t safe. On configs with PAGE_SHIFT > 16 (PAGE_SIZE_256KB on hexagon and powerpc 44x, both 32-bit), uint16_t cannot represent the sub-page remainder. Define obj_stock_bytes_t as 'unsigned int' on those archs so nr_bytes can hold the full remainder and the normal page-boundary flush in __refill_obj_stock() and the page extraction in drain_obj_stock() both work correctly. The single-cache-line layout target only applies to PAGE_SHIFT <= 16; those archs are 32-bit embedded and not the optimization target. Link: https://lore.kernel.org/20260526033931.1760588-3-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt Tested-by: kernel test robot Reviewed-by: Harry Yoo (Oracle) Acked-by: Qi Zheng Acked-by: Muchun Song Cc: Alexandre Ghiti Cc: Johannes Weiner Cc: Joshua Hahn Cc: Michal Hocko Cc: Roman Gushchin Signed-off-by: Andrew Morton

memcg: store node_id instead of pglist_data pointer

2026-06-04T21:45:06+00:00

Patch series "memcg: shrink obj_stock_pcp and cache multiple objcgs", v3. Commit 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") split a memcg's single obj_cgroup into one per NUMA node so that reparenting LRU folios can take per-node lru locks. As a side effect, the per-CPU obj_stock_pcp -- which caches a single cached_objcg pointer -- thrashes on workloads where threads of the same memcg run on different NUMA nodes. The kernel test robot reported a 67.7% regression on stress-ng.switch.ops_per_sec from this pattern. Commit d0211878ce06 ("memcg: cache obj_stock by memcg, not by objcg pointer") landed as a temporary fix by treating sibling per-node objcgs as equivalent for the cache lookup, intended to be reverted once per-node kmem accounting is introduced. This series takes a more general approach: cache multiple objcgs per CPU using the multi-slot pattern memcg_stock_pcp already uses, so the per-node objcg variants of one memcg can all coexist in the stock without ever forcing a drain. The temporary fix can then be reverted. To avoid increasing the per-CPU cache footprint, the first three patches shrink the existing single-slot obj_stock_pcp fields. The final patch converts cached_objcg and nr_bytes into NR_OBJ_STOCK=5 slot arrays and reorders the struct so the entire consume/refill/account hot path fits within a single 64-byte cache line on non-debug 64-bit builds (verified with pahole). This patch (of 4): The struct obj_stock_pcp stores a pointer to pglist_data for the slab stats cached on the cpu. On 64-bit machines, this costs 8 bytes. The pointer is not strictly required: NODE_DATA() can recover it from the node id. Replace cached_pgdat with int16_t node_id and use NUMA_NO_NODE as the "no stats cached" sentinel. At the moment all the archs limit MAX_NUMNODES to 1024 so int16_t is plenty; a BUILD_BUG_ON() makes sure we notice if that ever changes. Link: https://lore.kernel.org/20260526033931.1760588-1-shakeel.butt@linux.dev Link: https://lore.kernel.org/20260526033931.1760588-2-shakeel.butt@linux.dev Fixes: 01b9da291c49 ("mm: memcontrol: convert objcg to be per-memcg per-node type") Signed-off-by: Shakeel Butt Tested-by: kernel test robot Acked-by: Muchun Song Reviewed-by: Harry Yoo (Oracle) Acked-by: Qi Zheng Cc: Alexandre Ghiti Cc: Johannes Weiner Cc: Joshua Hahn Cc: Michal Hocko Cc: Roman Gushchin Signed-off-by: Andrew Morton

mm/memcg: remove no longer used swap cgroup array

2026-06-02T22:22:23+00:00

Now all swap cgroup records are stored in the swap cluster directly, the static array is no longer needed. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton

mm/memcg, swap: store cgroup id in cluster table directly

2026-06-02T22:22:23+00:00

Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table instead. The per-cluster memcg table is 1024 / 512 bytes on most archs, and does not need RCU protection: the cgroup data is only read and written under the cluster lock. That keeps things simple, lets the allocation use plain kmalloc with immediate kfree (no deferred free), and keeps fragmentation acceptable. [akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n] Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com [akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton

mm, swap: delay and unify memcg lookup and charging for swapin

2026-06-02T22:22:22+00:00

Instead of checking the cgroup private ID during page table walk in swap_pte_batch(), move the memcg lookup into __swap_cache_add_check() under the cluster lock. The first pre-alloc check is speculative and skips the memcg check since the post-alloc stable check ensures all slots covered by the folio belong to the same memcg. It is very rare for contiguous and aligned entries across a contiguous region of a page table of the same process or shmem mapping to belong to different memcgs. This also prepares for recording the memcg info in the cluster's table. Also make the order check and fallback more compact. There should be no user-observable behavior change. Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song Acked-by: Chris Li Cc: Baolin Wang Cc: Baoquan He Cc: Barry Song Cc: Chengming Zhou Cc: David Hildenbrand Cc: Hugh Dickins Cc: Johannes Weiner Cc: Kemeng Shi Cc: Lorenzo Stoakes Cc: Muchun Song Cc: Nhat Pham Cc: Roman Gushchin Cc: Shakeel Butt Cc: Youngjun Park Cc: Zi Yan Signed-off-by: Andrew Morton