kernel/linux.git/mm/slub.c, branch v7.0.10

mm/slab: return NULL early from kmalloc_nolock() in NMI on UP

2026-05-07T04:14:17+00:00

commit 5b31044e649e3e54c2caef135c09b371c2fbcd08 upstream. On UP kernels (!CONFIG_SMP), spin_trylock() is a no-op that unconditionally succeeds even when the lock is already held. As a result, kmalloc_nolock() called from NMI context can re-enter the slab allocator and acquire n->list_lock that the interrupted context is already holding, corrupting slab state. With CONFIG_DEBUG_SPINLOCK on UP, the following BUG is triggered with the slub_kunit test module: BUG: spinlock trylock failure on UP on CPU#0, kunit_try_catch/243 [...] Call Trace: dump_stack_lvl+0x3f/0x60 do_raw_spin_trylock+0x41/0x50 _raw_spin_trylock+0x24/0x50 get_from_partial_node+0x120/0x4d0 ___slab_alloc+0x8a/0x4c0 kmalloc_nolock_noprof+0x164/0x310 [...] Fix this by returning NULL early when invoked from NMI on a UP kernel. Link: https://lore.kernel.org/linux-mm/ad_cqe51pvr1WaDg@hyeyoo Cc: stable@vger.kernel.org Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Signed-off-by: Harry Yoo (Oracle) Link: https://patch.msgid.link/20260427-nolock-api-fix-v2-2-a6b83a92d9a4@kernel.org Signed-off-by: Vlastimil Babka (SUSE) Signed-off-by: Greg Kroah-Hartman

slub: fix data loss and overflow in krealloc()

2026-05-07T04:13:58+00:00

commit 082a6d03a2d685a83a332666b500ad3966349588 upstream. Commit 2cd8231796b5 ("mm/slub: allow to set node and align in k[v]realloc") introduced the ability to force a reallocation if the original object does not satisfy new alignment or NUMA node, even when the object is being shrunk. This introduced two bugs in the reallocation fallback path: 1. Data loss during NUMA migration: The jump to 'alloc_new' happens before 'ks' and 'orig_size' are initialized. As a result, the memcpy() in the 'alloc_new' block would copy 0 bytes into the new allocation. 2. Buffer overflow during shrinking: When shrinking an object while forcing a new alignment, 'new_size' is smaller than the old size. However, the memcpy() used the old size ('orig_size ?: ks'), leading to an out-of-bounds write. The same overflow bug exists in the kvrealloc() fallback path, where the old bucket size ksize(p) is copied into the new buffer without being bounded by the new size. A simple reproducer: // e.g. add to lkdtm as KREALLOC_SHRINK_OVERFLOW while (1) { void *p = kmalloc(128, GFP_KERNEL); p = krealloc_node_align(p, 64, 256, GFP_KERNEL, NUMA_NO_NODE); kfree(p); } demonstrates the issue: ================================================================== BUG: KFENCE: out-of-bounds write in memcpy_orig+0x68/0x130 Out-of-bounds write at 0xffff8883ad757038 (120B right of kfence-#47): memcpy_orig+0x68/0x130 krealloc_node_align_noprof+0x1c8/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... kfence-#47: 0xffff8883ad756fc0-0xffff8883ad756fff, size=64, cache=kmalloc-64 allocated by task 316 on cpu 7 at 97.680481s (0.021813s ago): krealloc_node_align_noprof+0x19c/0x340 lkdtm_KREALLOC_SHRINK_OVERFLOW+0x8c/0xc0 [lkdtm] lkdtm_do_action+0x3a/0x60 [lkdtm] ... ================================================================== Fix it by moving the old size calculation to the top of __do_krealloc() and bounding all copy lengths by the new allocation size. Fixes: 2cd8231796b5 ("mm/slub: allow to set node and align in k[v]realloc") Cc: stable@vger.kernel.org Reported-by: https://sashiko.dev/#/patchset/20260415143735.2974230-1-elver%40google.com Signed-off-by: Marco Elver Link: https://patch.msgid.link/20260416132837.3787694-1-elver@google.com Reviewed-by: Harry Yoo (Oracle) Signed-off-by: Vlastimil Babka (SUSE) Signed-off-by: Greg Kroah-Hartman

slab: fix memory leak when refill_sheaf() fails

2026-03-11T16:55:26+00:00

When refill_sheaf() partially fills one sheaf (e.g., fills 5 objects but need to fill 10), it will update sheaf->size and return -ENOMEM. However, the callers (alloc_full_sheaf() and __pcs_replace_empty_main()) directly call free_empty_sheaf() on failure, which only does kfree(sheaf), causing the partially allocated objects memory in sheaf->objects[] leaked. Fix this by calling sheaf_flush_unused() before free_empty_sheaf() to free objects of sheaf->objects[]. And also add a WARN_ON() in free_empty_sheaf() to catch any future cases where a non-empty sheaf is being freed. Fixes: ed30c4adfc2b ("slab: add optimized sheaf refill from partial list") Signed-off-by: Qing Wang Link: https://patch.msgid.link/20260311093617.4155965-1-wangqing7171@gmail.com Reviewed-by: Harry Yoo Reviewed-by: Hao Li Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: fix an incorrect check in obj_exts_alloc_size()

2026-03-10T10:02:54+00:00

obj_exts_alloc_size() prevents recursive allocation of slabobj_ext array from the same cache, to avoid creating slabs that are never freed. There is one mistake that returns the original size when memory allocation profiling is disabled. The assumption was that memcg-triggered slabobj_ext allocation is always served from KMALLOC_CGROUP type. But this is wrong [1]: when the caller specifies both __GFP_RECLAIMABLE and __GFP_ACCOUNT with SLUB_TINY enabled, the allocation is served from normal kmalloc. This is because kmalloc_type() prioritizes __GFP_RECLAIMABLE over __GFP_ACCOUNT, and SLUB_TINY aliases KMALLOC_RECLAIM with KMALLOC_NORMAL. As a result, the recursion guard is bypassed and the problematic slabs can be created. Fix this by removing the mem_alloc_profiling_enabled() check entirely. The remaining is_kmalloc_normal() check is still sufficient to detect whether the cache is of KMALLOC_NORMAL type and avoid bumping the size if it's not. Without SLUB_TINY, no functional change intended. With SLUB_TINY, allocations with __GFP_ACCOUNT|__GFP_RECLAIMABLE now allocate a larger array if the sizes equal. Reported-by: Zw Tang Fixes: 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from its own slab") Closes: https://lore.kernel.org/linux-mm/CAPHJ_VKuMKSke8b11AZQw1PTSFN4n2C0gFxC6xGOG0ZLHgPmnA@mail.gmail.com [1] Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo Link: https://patch.msgid.link/20260309072219.22653-1-harry.yoo@oracle.com Tested-by: Zw Tang Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: allow sheaf refill if blocking is not allowed

2026-03-04T10:03:54+00:00

Ming Lei reported [1] a regression in the ublk null target benchmark due to sheaves. The profile shows that the alloc_from_pcs() fastpath fails and allocations fall back to ___slab_alloc(). It also shows the allocations happen through mempool_alloc(). The strategy of mempool_alloc() is to call the underlying allocator (here slab) without __GFP_DIRECT_RECLAIM first. This does not play well with __pcs_replace_empty_main() checking for gfpflags_allow_blocking() to decide if it should refill an empty sheaf or fallback to the slowpath, so we end up falling back. We could change the mempool strategy but there might be other paths doing the same ting. So instead allow sheaf refill when blocking is not allowed, changing the condition to gfpflags_allow_spinning(). The original condition was unnecessarily restrictive. Note this doesn't fully resolve the regression [1] as another component of that are memoryless nodes, which is to be addressed separately. Reported-by: Ming Lei Fixes: e47c897a2949 ("slab: add sheaves to most caches") Link: https://lore.kernel.org/all/aZ0SbIqaIkwoW2mB@fedora/ [1] Link: https://patch.msgid.link/20260302095536.34062-2-vbabka@kernel.org Signed-off-by: Vlastimil Babka (SUSE)

slab: distinguish lock and trylock for sheaf_flush_main()

2026-03-02T09:04:22+00:00

sheaf_flush_main() can be called from __pcs_replace_full_main() where it's fine if the trylock fails, and pcs_flush_all() where it's not expected to and for some flush callers (when destroying the cache or memory hotremove) it would be actually a problem if it failed and left the main sheaf not flushed. The flush callers can however safely use local_lock() instead of trylock. The trylock failure should not happen in practice on !PREEMPT_RT, but can happen on PREEMPT_RT. The impact is limited in practice because when a trylock fails in the kmem_cache_destroy() path, it means someone is using the cache while destroying it, which is a bug on its own. The memory hotremove path is unlikely to be employed in a production RT config, but it's possible. To fix this, split the function into sheaf_flush_main() (using local_lock()) and sheaf_try_flush_main() (using local_trylock()) where both call __sheaf_flush_main_batch() to flush a single batch of objects. This will also allow lockdep to verify our context assumptions. The problem was raised in an off-list question by Marcelo. Fixes: 2d517aa09bbc ("slab: add opt-in caching layer of percpu sheaves") Cc: stable@vger.kernel.org Reported-by: Marcelo Tosatti Signed-off-by: Vlastimil Babka Reviewed-by: Harry Yoo Reviewed-by: Hao Li Link: https://patch.msgid.link/20260211-b4-sheaf-flush-v1-1-4e7f492f0055@suse.cz Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: initialize slab->stride early to avoid memory ordering issues

2026-02-27T15:22:57+00:00

When alloc_slab_obj_exts() is called later (instead of during slab allocation and initialization), slab->stride and slab->obj_exts are updated after the slab is already accessible by multiple CPUs. The current implementation does not enforce memory ordering between slab->stride and slab->obj_exts. For correctness, slab->stride must be visible before slab->obj_exts. Otherwise, concurrent readers may observe slab->obj_exts as non-zero while stride is still stale. With stale slab->stride, slab_obj_ext() could return the wrong obj_ext. This could cause two problems: - obj_cgroup_put() is called on the wrong objcg, leading to a use-after-free due to incorrect reference counting [1] by decrementing the reference count more than it was incremented. - refill_obj_stock() is called on the wrong objcg, leading to a page_counter overflow [2] by uncharging more memory than charged. Fix this by unconditionally initializing slab->stride in alloc_slab_obj_exts_early(), before the need_slab_obj_exts() check. In the case of SLAB_OBJ_EXT_IN_OBJ, it is overridden in the function. This ensures updates to slab->stride become visible before the slab can be accessed by other CPUs via the per-node partial slab list (protected by spinlock with acquire/release semantics). Thanks to Shakeel Butt for pointing out this issue [3]. [vbabka@kernel.org: the bug reports [1] and [2] are not yet fully fixed, with investigation ongoing, but it is nevertheless a step in the right direction to only set stride once after allocating the slab and not change it later ] Fixes: 7a8e71bc619d ("mm/slab: use stride to access slabobj_ext") Reported-by: Venkat Rao Bagalkote Link: https://lore.kernel.org/lkml/ca241daa-e7e7-4604-a48d-de91ec9184a5@linux.ibm.com [1] Link: https://lore.kernel.org/all/ddff7c7d-c0c3-4780-808f-9a83268bbf0c@linux.ibm.com [2] Link: https://lore.kernel.org/linux-mm/aZu9G9mVIVzSm6Ft@hyeyoo [3] Signed-off-by: Harry Yoo Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: mark alloc tags empty for sheaves allocated with __GFP_NO_OBJ_EXT

2026-02-26T16:30:32+00:00

alloc_empty_sheaf() allocates sheaves from SLAB_KMALLOC caches using __GFP_NO_OBJ_EXT to avoid recursion, however it does not mark their allocation tags empty before freeing, which results in a warning when CONFIG_MEM_ALLOC_PROFILING_DEBUG is set. Fix this by marking allocation tags for such sheaves as empty. The problem was technically introduced in commit 4c0a17e28340 but only becomes possible to hit with commit 913ffd3a1bf5. Fixes: 4c0a17e28340 ("slab: prevent recursive kmalloc() in alloc_empty_sheaf()") Fixes: 913ffd3a1bf5 ("slab: handle kmalloc sheaves bootstrap") Reported-by: David Wang <00107082@163.com> Closes: https://lore.kernel.org/all/20260223155128.3849-1-00107082@163.com/ Analyzed-by: Harry Yoo Signed-off-by: Suren Baghdasaryan Reviewed-by: Harry Yoo Tested-by: Harry Yoo Tested-by: David Wang <00107082@163.com> Link: https://patch.msgid.link/20260225163407.2218712-1-surenb@google.com Signed-off-by: Vlastimil Babka (SUSE)

mm/slab: pass __GFP_NOWARN to refill_sheaf() if fallback is available

2026-02-26T16:30:06+00:00

When refill_sheaf() is called, failing to refill the sheaf doesn't necessarily mean the allocation will fail because a fallback path might be available and serve the allocation request. Suppress spurious warnings by passing __GFP_NOWARN along with __GFP_NOMEMALLOC whenever a fallback path is available. When the caller is alloc_full_sheaf() or __pcs_replace_empty_main(), the kernel always falls back to the slowpath (__slab_alloc_node()). For __prefill_sheaf_pfmemalloc(), the fallback path is available only when gfp_pfmemalloc_allowed() returns true. Reported-and-tested-by: Chris Bainbridge Closes: https://lore.kernel.org/linux-mm/aZt2-oS9lkmwT7Ch@debian.local Fixes: 1ce20c28eafd ("slab: handle pfmemalloc slabs properly with sheaves") Link: https://lore.kernel.org/linux-mm/aZwSreGj9-HHdD-j@hyeyoo Signed-off-by: Harry Yoo Link: https://patch.msgid.link/20260223133322.16705-1-harry.yoo@oracle.com Tested-by: Mikhail Gavrilov Signed-off-by: Vlastimil Babka (SUSE)

Convert 'alloc_obj' family to use the new default GFP_KERNEL argument

2026-02-22T01:09:51+00:00

This was done entirely with mindless brute force, using git grep -l '\