kernel/linux.git/mm/slub.c, branch linux-7.1.y

Merge tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

2026-05-22T13:23:56+00:00

Pull slab fix from Vlastimil Babka: - Stable fix for a missing cpus_read_lock in one of the cpu sheaves flushing paths (Qing Wang) * tag 'slab-for-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()

mm/slub: hold cpus_read_lock around flush_rcu_sheaves_on_cache()

2026-05-14T12:56:58+00:00

flush_rcu_sheaves_on_cache() calls queue_work_on() in a for_each_online_cpu() loop, which requires the cpu to stay online. But cpus_read_lock() is not held in kvfree_rcu_barrier_on_cache() and the set of "online cpus" is subject to change. There are two paths that call flush_rcu_sheaves_on_cache(): // has cpus_read_lock() flush_all_rcu_sheaves() -> flush_rcu_sheaves_on_cache() // no cpus_read_lock() kvfree_rcu_barrier_on_cache() -> flush_rcu_sheaves_on_cache() Fix this by holding cpus_read_lock() in kvfree_rcu_barrier_on_cache(). Why not move cpus_read_lock() from flush_all_rcu_sheaves() into flush_rcu_sheaves_on_cache()? The reason is it would introduce a new lock order (slab_mutex -> cpu_hotplug_lock). The reverse order (cpu_hotplug_lock -> slab_mutex) is established by - cpuhp_setup_state_nocalls(..., slub_cpu_setup, ...) - kmem_cache_destroy() The two orders together would form an AB-BA deadlock. Finally, add lockdep_assert_cpus_held() in flush_rcu_sheaves_on_cache() to catch the same problem in the future. Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction") Cc: Signed-off-by: Qing Wang Link: https://patch.msgid.link/20260512035035.762317-1-wangqing7171@gmail.com Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: Add kvfree_atomic() helper

2026-05-05T08:12:07+00:00

kvmalloc() now supports non-sleeping GFP flags, including the vmalloc fallback path. This means it may return vmalloc memory even for GFP_ATOMIC and GFP_NOWAIT allocations. Freeing such memory with kvfree() may then end up calling vfree(), which is not safe for non-sleeping contexts. Introduce kvfree_atomic() helper for such cases. It mirrors kvfree(), but uses vfree_atomic() for vmalloced memory. Signed-off-by: Uladzislau Rezki (Sony) Acked-by: Vlastimil Babka (SUSE) Acked-by: Harry Yoo (Oracle) Signed-off-by: Herbert Xu

mm/slab: return NULL early from kmalloc_nolock() in NMI on UP

2026-04-27T07:14:36+00:00

On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, kmalloc_nolock() called from NMI context can re-enter the slab allocator and acquire n->list_lock that the interrupted context is already holding, corrupting slab state. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 get_from_partial_node+0x120/0x4d0 ___slab_alloc+0x8a/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Harry Yoo (Oracle) Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-2-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE)

slub: fix data loss and overflow in krealloc()

2026-04-17T09:07:48+00:00

Commit 2cd8231796b5 ("mm/slub: allow to set node and align in k[v]realloc") introduced the ability to force a reallocation if the original object does not satisfy new alignment or NUMA node, even when the object is being shrunk. This introduced two bugs in the reallocation fallback path: 1. Data loss during NUMA migration: The jump to 'alloc_new' happens before 'ks' and 'orig_size' are initialized. As a result, the memcpy() in the 'alloc_new' block would copy 0 bytes into the new allocation. 2. Buffer overflow during shrinking: When shrinking an object while forcing a new alignment, 'new_size' is smaller than the old size. However, the memcpy() used the old size ('orig_size ?: ks'), leading to an out-of-bounds write. The same overflow bug exists in the kvrealloc() fallback path, where the old bucket size ksize(p) is copied into the new buffer without being bounded by the new size. A simple reproducer: // e.g. add to lkdtm as KREALLOC_SHRINK_OVERFLOW while (1) { void *p = kmalloc(128, GFP_KERNEL); p = krealloc_node_align(p, 64, 256, GFP_KERNEL, NUMA_NO_NODE); kfree(p); } demonstrates the issue: ================================================================== BUG: KFENCE: out-of-bounds write in memcpy_orig+0x68/0x130 Out-of-bounds write at 0xffff8883ad757038 (120B right of kfence-#47): memcpy_orig+0x68/0x130 krealloc_node_align_noprof+0x1c8/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... kfence-#47: 0xffff8883ad756fc0-0xffff8883ad756fff, size=64, cache=kmalloc-64 allocated by task 316 on cpu 7 at 97.680481s (0.021813s ago): krealloc_node_align_noprof+0x19c/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... ================================================================== Fix it by moving the old size calculation to the top of __do_krealloc() and bounding all copy lengths by the new allocation size. Fixes: 2cd8231796b5 ("mm/slub: allow to set node and align in k[v]realloc") Cc: stable@vger.kernel.org Reported-by: https://sashiko.dev/#/patchset/20260415143735.2974230-1-elver%40google.com Signed-off-by: Marco Elver Link: https://patch.msgid.link/20260416132837.3787694-1-elver@google.com Reviewed-by: Harry Yoo (Oracle) Signed-off-by: Vlastimil Babka (SUSE)

Merge branch 'slab/for-7.1/misc' into slab/for-next

2026-04-13T11:23:36+00:00

Merge misc slab changes that are not related to sheaves. Various improvements for sysfs, debugging and testing.

slub: clarify kmem_cache_refill_sheaf() comments

2026-04-07T12:39:11+00:00

In the in-place refill case, some objects may already have been added before the function returns -ENOMEM. Clarify this behavior and polish the rest of the comment for readability. Acked-by: Harry Yoo (Oracle) Signed-off-by: Hao Li Link: https://patch.msgid.link/20260407120018.42692-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE)

slub: use N_NORMAL_MEMORY in can_free_to_pcs to handle remote frees

2026-04-07T09:10:52+00:00

Memory hotplug now keeps N_NORMAL_MEMORY up to date correctly, so make can_free_to_pcs() use it. As a result, when freeing objects on memoryless nodes, or on nodes that have memory but only in ZONE_MOVABLE, the objects can be freed to the sheaf instead of going through the slow path. Signed-off-by: Hao Li Acked-by: Harry Yoo (Oracle) Acked-by: David Rientjes Link: https://patch.msgid.link/20260403073958.8722-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka (SUSE)

slab: free remote objects to sheaves on memoryless nodes

2026-03-19T12:22:49+00:00

On memoryless nodes we can now allocate from cpu sheaves and refill them normally. But when a node is memoryless on a system without actual CONFIG_HAVE_MEMORYLESS_NODES support, freeing always uses the slowpath because all objects appear as remote. We could instead benefit from the freeing fastpath, because the allocations can't obtain local objects anyway if the node is memoryless. Thus adapt the locality check when freeing, and move them to an inline function can_free_to_pcs() for a single shared implementation. On configurations with CONFIG_HAVE_MEMORYLESS_NODES=y continue using numa_mem_id() so the percpu sheaves and barn on a memoryless node will contain mostly objects from the closest memory node (returned by numa_mem_id()). No change is thus intended for such configuration. On systems with CONFIG_HAVE_MEMORYLESS_NODES=n use numa_node_id() (the cpu's node) since numa_mem_id() just aliases it anyway. But if we are freeing on a memoryless node, allow the freeing to use percpu sheaves for objects from any node, since they are all remote anyway. This way we avoid the slowpath and get more performant freeing. The potential downside is that allocations will obtain objects with a larger average distance. If we kept bypassing the sheaves on freeing, a refill of sheaves from slabs would tend to get closer objects thanks to the ordering of the zonelist. Architectures that allow de-facto memoryless nodes without proper CONFIG_HAVE_MEMORYLESS_NODES support should perhaps consider adding such support. Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-3-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) Reviewed-by: Harry Yoo Reviewed-by: Hao Li

slab: create barns for online memoryless nodes

2026-03-19T12:22:44+00:00

Ming Lei has reported [1] a performance regression due to replacing cpu (partial) slabs with sheaves. With slub stats enabled, a large amount of slowpath allocations were observed. The affected system has 8 online NUMA nodes but only 2 have memory. For sheaves to work effectively on given cpu, its NUMA node has to have struct node_barn allocated. Those are currently only allocated on nodes with memory (N_MEMORY) where kmem_cache_node also exist as the goal is to cache only node-local objects. But in order to have good performance on a memoryless node, we need its barn to exist and use sheaves to cache non-local objects (as no local objects can exist anyway). Therefore change the implementation to allocate barns on all online nodes, tracked in a new nodemask slab_barn_nodes. Also add a cpu hotplug callback as that's when a memoryless node can become online. Change both get_barn() and rcu_sheaf->node assignment to numa_node_id() so it's returned to the barn of the local cpu's (potentially memoryless) node, and not to the nearest node with memory anymore. On systems with CONFIG_HAVE_MEMORYLESS_NODES=y (which are not the main target of this change) barns did not exist on memoryless nodes, but get_barn() using numa_mem_id() meant a barn was returned from the nearest node with memory. This works, but the barn lock contention increases with every such memoryless node. With this change, barn will be allocated also on the memoryless node, reducing this contention in exchange for increased memory consumption. Reported-by: Ming Lei Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1] Link: https://patch.msgid.link/20260311-b4-slab-memoryless-barns-v1-2-70ab850be4ce@kernel.org Signed-off-by: Vlastimil Babka (SUSE) Reviewed-by: Harry Yoo Reviewed-by: Hao Li