summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2026-02-10mm/slab: allow freeing kmalloc_nolock()'d objects using kfree[_rcu]()Harry Yoo3-15/+32
Slab objects that are allocated with kmalloc_nolock() must be freed using kfree_nolock() because only a subset of alloc hooks are called, since kmalloc_nolock() can't spin on a lock during allocation. This imposes a limitation: such objects cannot be freed with kfree_rcu(), forcing users to work around this limitation by calling call_rcu() with a callback that frees the object using kfree_nolock(). Remove this limitation by teaching kmemleak to gracefully ignore cases when kmemleak_free() or kmemleak_ignore() is called without a prior kmemleak_alloc(). Unlike kmemleak, kfence already handles this case, because, due to its design, only a subset of allocations are served from kfence. With this change, kfree() and kfree_rcu() can be used to free objects that are allocated using kmalloc_nolock(). Suggested-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210044642.139482-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10mm/slab: use prandom if !allow_spinHarry Yoo1-4/+24
When CONFIG_SLAB_FREELIST_RANDOM is enabled and get_random_u32() is called in an NMI context, lockdep complains because it acquires a local_lock: ================================ WARNING: inconsistent lock state 6.19.0-rc5-slab-for-next+ #325 Tainted: G N -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. kunit_try_catch/8312 [HC2[2]:SC0[0]:HE0:SE1] takes: ffff88a02ec49cc0 (batched_entropy_u32.lock){-.-.}-{3:3}, at: get_random_u32+0x7f/0x2e0 {INITIAL USE} state was registered at: lock_acquire+0xd9/0x2f0 get_random_u32+0x93/0x2e0 __get_random_u32_below+0x17/0x70 cache_random_seq_create+0x121/0x1c0 init_cache_random_seq+0x5d/0x110 do_kmem_cache_create+0x1e0/0xa30 __kmem_cache_create_args+0x4ec/0x830 create_kmalloc_caches+0xe6/0x130 kmem_cache_init+0x1b1/0x660 mm_core_init+0x1d8/0x4b0 start_kernel+0x620/0xcd0 x86_64_start_reservations+0x18/0x30 x86_64_start_kernel+0xf3/0x140 common_startup_64+0x13e/0x148 irq event stamp: 76 hardirqs last enabled at (75): [<ffffffff8298b77a>] exc_nmi+0x11a/0x240 hardirqs last disabled at (76): [<ffffffff8298b991>] sysvec_irq_work+0x11/0x110 softirqs last enabled at (0): [<ffffffff813b2dda>] copy_process+0xc7a/0x2350 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(batched_entropy_u32.lock); <Interrupt> lock(batched_entropy_u32.lock); *** DEADLOCK *** Fix this by using pseudo-random number generator if !allow_spin. This means kmalloc_nolock() users won't get truly random numbers, but there is not much we can do about it. Note that an NMI handler might interrupt prandom_u32_state() and change the random state, but that's safe. Link: https://lore.kernel.org/all/0c33bdee-6de8-4d9f-92ca-4f72c1b6fb9f@suse.cz Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210081900.329447-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10mm/slab: do not access current->mems_allowed_seq if !allow_spinHarry Yoo1-2/+11
Lockdep complains when get_from_any_partial() is called in an NMI context, because current->mems_allowed_seq is seqcount_spinlock_t and not NMI-safe: ================================ WARNING: inconsistent lock state 6.19.0-rc5-kfree-rcu+ #315 Tainted: G N -------------------------------- inconsistent {INITIAL USE} -> {IN-NMI} usage. kunit_try_catch/9989 [HC1[1]:SC0[0]:HE0:SE1] takes: ffff889085799820 (&____s->seqcount#3){.-.-}-{0:0}, at: ___slab_alloc+0x58f/0xc00 {INITIAL USE} state was registered at: lock_acquire+0x185/0x320 kernel_init_freeable+0x391/0x1150 kernel_init+0x1f/0x220 ret_from_fork+0x736/0x8f0 ret_from_fork_asm+0x1a/0x30 irq event stamp: 56 hardirqs last enabled at (55): [<ffffffff850a68d7>] _raw_spin_unlock_irq+0x27/0x70 hardirqs last disabled at (56): [<ffffffff850858ca>] __schedule+0x2a8a/0x6630 softirqs last enabled at (0): [<ffffffff81536711>] copy_process+0x1dc1/0x6a10 softirqs last disabled at (0): [<0000000000000000>] 0x0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&____s->seqcount#3); <Interrupt> lock(&____s->seqcount#3); *** DEADLOCK *** According to Documentation/locking/seqlock.rst, seqcount_t is not NMI-safe and seqcount_latch_t should be used when read path can interrupt the write-side critical section. In this case, do not access current->mems_allowed_seq and avoid retry. Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Cc: stable@vger.kernel.org Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260210081900.329447-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-10Merge branch 'slab/for-7.0/sheaves' into slab/for-nextVlastimil Babka7-1698/+909
Merge series "slab: replace cpu (partial) slabs with sheaves". The percpu sheaves caching layer was introduced as opt-in but the goal was to eventually move all caches to them. This is the next step, enabling sheaves for all caches (except the two bootstrap ones) and then removing the per cpu (partial) slabs and lots of associated code. Besides the lower locking overhead and much more likely fastpath when freeing, this removes the rather complicated code related to the cpu slab lockless fastpaths (using this_cpu_try_cmpxchg128/64) and all its complications for PREEMPT_RT or kmalloc_nolock(). The lockless slab freelist+counters update operation using try_cmpxchg128/64 remains and is crucial for freeing remote NUMA objects and to allow flushing objects from sheaves to slabs mostly without the node list_lock. Link: https://lore.kernel.org/all/20260123-sheaves-for-all-v4-0-041323d506f7@suse.cz/
2026-02-06slub: let need_slab_obj_exts() return false if SLAB_NO_OBJ_EXT is setHao Li1-0/+3
SLAB_NO_OBJ_EXT is set for boot caches, but need_slab_obj_exts() doesn't check this flag. We should return false unconditionally when SLAB_NO_OBJ_EXT is set. Signed-off-by: Hao Li <hao.li@linux.dev> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260205120709.425719-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: only allow SLAB_OBJ_EXT_IN_OBJ for unmergeable cachesHarry Yoo3-3/+4
While SLAB_OBJ_EXT_IN_OBJ allows to reduce memory overhead to account slab objects, it prevents slab merging because merging can change the metadata layout. As pointed out Vlastimil Babka, disabling merging solely for this memory optimization may not be a net win, because disabling slab merging tends to increase overall memory usage. Restrict SLAB_OBJ_EXT_IN_OBJ to caches that are already unmergeable for other reasons (e.g., those with constructors or SLAB_TYPESAFE_BY_RCU). Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260127103151.21883-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: place slabobj_ext metadata in unused space within s->sizeHarry Yoo3-11/+101
When a cache has high s->align value and s->object_size is not aligned to it, each object ends up with some unused space because of alignment. If this wasted space is big enough, we can use it to store the slabobj_ext metadata instead of wasting it. On my system, this happens with caches like kmem_cache, mm_struct, pid, task_struct, sighand_cache, xfs_inode, and others. To place the slabobj_ext metadata within each object, the existing slab_obj_ext() logic can still be used by setting: - slab->obj_exts = slab_address(slab) + (slabobj_ext offset) - stride = s->size slab_obj_ext() doesn't need know where the metadata is stored, so this method works without adding extra overhead to slab_obj_ext(). A good example benefiting from this optimization is xfs_inode (object_size: 992, align: 64). To measure memory savings, 2 millions of files were created on XFS. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on xfs): Slab: 5175976 kB SReclaimable: 3837524 kB SUnreclaim: 1338452 kB After patch (creating ~2.64M directories on xfs): Slab: 5152912 kB SReclaimable: 3838568 kB SUnreclaim: 1314344 kB (-23.54 MiB) Enjoy the memory savings! Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-10-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: move [__]ksize and slab_ksize() to mm/slub.cHarry Yoo4-89/+86
To access SLUB's internal implementation details beyond cache flags in ksize(), move __ksize(), ksize(), and slab_ksize() to mm/slub.c. [vbabka@suse.cz: also make __ksize() static and move its kerneldoc to ksize() ] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-9-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: save memory by allocating slabobj_ext array from leftoverHarry Yoo1-5/+150
The leftover space in a slab is always smaller than s->size, and kmem caches for large objects that are not power-of-two sizes tend to have a greater amount of leftover space per slab. In some cases, the leftover space is larger than the size of the slabobj_ext array for the slab. An excellent example of such a cache is ext4_inode_cache. On my system, the object size is 1136, with a preferred order of 3, 28 objects per slab, and 960 bytes of leftover space per slab. Since the size of the slabobj_ext array is only 224 bytes (w/o mem profiling) or 448 bytes (w/ mem profiling) per slab, the entire array fits within the leftover space. Allocate the slabobj_exts array from this unused space instead of using kcalloc() when it is large enough. The array is allocated from unused space only when creating new slabs, and it doesn't try to utilize unused space if alloc_slab_obj_exts() is called after slab creation because implementing lazy allocation involves more expensive synchronization. The implementation and evaluation of lazy allocation from unused space is left as future-work. As pointed by Vlastimil Babka [1], it could be beneficial when a slab cache without SLAB_ACCOUNT can be created, and some of the allocations from the cache use __GFP_ACCOUNT. For example, xarray does that. To avoid unnecessary overhead when MEMCG (with SLAB_ACCOUNT) and MEM_ALLOC_PROFILING are not used for the cache, allocate the slabobj_ext array only when either of them is enabled on slab allocation. [ MEMCG=y, MEM_ALLOC_PROFILING=n ] Before patch (creating ~2.64M directories on ext4): Slab: 4747880 kB SReclaimable: 4169652 kB SUnreclaim: 578228 kB After patch (creating ~2.64M directories on ext4): Slab: 4724020 kB SReclaimable: 4169188 kB SUnreclaim: 554832 kB (-22.84 MiB) Enjoy the memory savings! Link: https://lore.kernel.org/linux-mm/48029aab-20ea-4d90-bfd1-255592b2018e@suse.cz [1] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-8-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/memcontrol,alloc_tag: handle slabobj_ext access under KASAN poisonHarry Yoo3-40/+95
In the near future, slabobj_ext may reside outside the allocated slab object range within a slab, which could be reported as an out-of-bounds access by KASAN. As suggested by Andrey Konovalov [1], explicitly disable KASAN and KMSAN checks when accessing slabobj_ext within slab allocator, memory profiling, and memory cgroup code. While an alternative approach could be to unpoison slabobj_ext, out-of-bounds accesses outside the slab allocator are generally more common. Move metadata_access_enable()/disable() helpers to mm/slab.h so that it can be used outside mm/slub.c. However, as suggested by Suren Baghdasaryan [2], instead of calling them directly from mm code (which is more prone to errors), change users to access slabobj_ext via get/put APIs: - Users should call get_slab_obj_exts() to access slabobj_metadata and call put_slab_obj_exts() when it's done. - From now on, accessing it outside the section covered by get_slab_obj_exts() ~ put_slab_obj_exts() is illegal. This ensures that accesses to slabobj_ext metadata won't be reported as access violations. Call kasan_reset_tag() in slab_obj_ext() before returning the address to prevent SW or HW tag-based KASAN from reporting false positives. Suggested-by: Andrey Konovalov <andreyknvl@gmail.com> Suggested-by: Suren Baghdasaryan <surenb@google.com> Link: https://lore.kernel.org/linux-mm/CA+fCnZezoWn40BaS3cgmCeLwjT+5AndzcQLc=wH3BjMCu6_YCw@mail.gmail.com [1] Link: https://lore.kernel.org/linux-mm/CAJuCfpG=Lb4WhYuPkSpdNO4Ehtjm1YcEEK0OM=3g9i=LxmpHSQ@mail.gmail.com [2] Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-7-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: use stride to access slabobj_extHarry Yoo2-4/+35
Use a configurable stride value when accessing slab object extension metadata instead of assuming a fixed sizeof(struct slabobj_ext). Store stride value in free bits of slab->counters field. This allows for flexibility in cases where the extension is embedded within slab objects. Since these free bits exist only on 64-bit, any future optimizations that need to change stride value cannot be enabled on 32-bit architectures. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-6-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: abstract slabobj_ext access via new slab_obj_ext() helperHarry Yoo3-32/+79
Currently, the slab allocator assumes that slab->obj_exts is a pointer to an array of struct slabobj_ext objects. However, to support storage methods where struct slabobj_ext is embedded within objects, the slab allocator should not make this assumption. Instead of directly dereferencing the slabobj_exts array, abstract access to struct slabobj_ext via helper functions. Introduce a new API slabobj_ext metadata access: slab_obj_ext(slab, obj_exts, index) - returns the pointer to struct slabobj_ext element at the given index. Directly dereferencing the return value of slab_obj_exts() is no longer allowed. Instead, slab_obj_ext() must always be used to access individual struct slabobj_ext objects. Convert all users to use these APIs. No functional changes intended. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-5-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04ext4: specify the free pointer offset for ext4_inode_cacheHarry Yoo1-6/+13
Convert ext4_inode_cache to use the kmem_cache_args interface and specify a free pointer offset. Since ext4_inode_cache uses a constructor, the free pointer would be placed after the object to prevent overwriting fields used by the constructor. However, some fields such as ->i_flags are not used by the constructor and can safely be repurposed for the free pointer. Specify the free pointer offset at i_flags to reduce the object size. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-4-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: allow specifying free pointer offset when using constructorHarry Yoo3-17/+21
When a slab cache has a constructor, the free pointer is placed after the object because certain fields must not be overwritten even after the object is freed. However, some fields that the constructor does not initialize can safely be overwritten after free. Allow specifying the free pointer offset within the object, reducing the overall object size when some fields can be reused for the free pointer. Adjust the document accordingly. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113061845.159790-3-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: use unsigned long for orig_size to ensure proper metadata alignHarry Yoo1-7/+7
When both KASAN and SLAB_STORE_USER are enabled, accesses to struct kasan_alloc_meta fields can be misaligned on 64-bit architectures. This occurs because orig_size is currently defined as unsigned int, which only guarantees 4-byte alignment. When struct kasan_alloc_meta is placed after orig_size, it may end up at a 4-byte boundary rather than the required 8-byte boundary on 64-bit systems. Note that 64-bit architectures without HAVE_EFFICIENT_UNALIGNED_ACCESS are assumed to require 64-bit accesses to be 64-bit aligned. See HAVE_64BIT_ALIGNED_ACCESS and commit adab66b71abf ("Revert: "ring-buffer: Remove HAVE_64BIT_ALIGNED_ACCESS"") for more details. Change orig_size from unsigned int to unsigned long to ensure proper alignment for any subsequent metadata. This should not waste additional memory because kmalloc objects are already aligned to at least ARCH_KMALLOC_MINALIGN. Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo Suggested-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: stable@vger.kernel.org Fixes: 6edf2576a6cc ("mm/slub: enable debugging memory wasting of kmalloc") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Closes: https://lore.kernel.org/all/aPrLF0OUK651M4dk@hyeyoo/ Link: https://patch.msgid.link/20260113061845.159790-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04slub: clarify object field layout commentsHao Li1-33/+49
The comments above check_pad_bytes() document the field layout of a single object. Rewrite them to improve clarity and precision. Also update an outdated comment in calculate_sizes(). Suggested-by: Harry Yoo <harry.yoo@oracle.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Hao Li <hao.li@linux.dev> Link: https://patch.msgid.link/20251229122415.192377-1-hao.li@linux.dev Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-02-04mm/slab: avoid allocating slabobj_ext array from its own slabHarry Yoo1-7/+53
When allocating slabobj_ext array in alloc_slab_obj_exts(), the array can be allocated from the same slab we're allocating the array for. This led to obj_exts_in_slab() incorrectly returning true [1], although the array is not allocated from wasted space of the slab. Vlastimil Babka observed that this problem should be fixed even when ignoring its incompatibility with obj_exts_in_slab(), because it creates slabs that are never freed as there is always at least one allocated object. To avoid this, use the next kmalloc size or large kmalloc when the array can be allocated from the same cache we're allocating the array for. In case of random kmalloc caches, there are multiple kmalloc caches for the same size and the cache is selected based on the caller address. Because it is fragile to ensure the same caller address is passed to kmalloc_slab(), kmalloc_noprof(), and kmalloc_node_noprof(), bump the size to (s->object_size + 1) when the sizes are equal, instead of directly comparing the kmem_cache pointers. Note that this doesn't happen when memory allocation profiling is disabled, as when the allocation of the array is triggered by memory cgroup (KMALLOC_CGROUP), the array is allocated from KMALLOC_NORMAL. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202601231457.f7b31e09-lkp@intel.com [1] Cc: stable@vger.kernel.org Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260126125714.88008-1-harry.yoo@oracle.com Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slub: avoid list_lock contention from __refill_objects_any()Vlastimil Babka1-6/+13
Kernel test robot has reported a regression in the patch "slab: refill sheaves from all nodes". When taken in isolation like this, there is indeed a tradeoff - we prefer to use remote objects prior to allocating new local slabs. It is replicating a behavior that existed before sheaves for replenishing cpu (partial) slabs - now called get_from_any_partial() to allocate a single object. So the possibility of allocating remote objects is intended even if remote accesses are then slower. But the profiles in the report also suggested a contention on the list_lock spinlock. And that's something we can try to avoid without much tradeoff - if someone else has the spin_lock, it's more likely they are allocating from the node than freeing to it, so we can skip it even if it means allocating a new local slab - contributing to that lock's contention isn't worth it. It should not result in partial slabs accumulating on the remote node. Thus add an allow_spin parameter to __refill_objects_node() and get_partial_node_bulk() to make the attempts from __refill_objects_any() use only a trylock. Reported-by: kernel test robot <oliver.sang@intel.com> Link: https://lore.kernel.org/oe-lkp/202601132136.77efd6d7-lkp@intel.com Link: https://patch.msgid.link/20260129-b4-refill_any_trylock-v1-1-de7420b25840@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29mm/slub: cleanup and repurpose some stat itemsVlastimil Babka1-57/+26
A number of stat items related to cpu slabs became unused, remove them. Two of those were ALLOC_FASTPATH and FREE_FASTPATH. But instead of removing those, use them instead of ALLOC_PCS and FREE_PCS, since sheaves are the new (and only) fastpaths, Remove the recently added _PCS variants instead. Change where FREE_SLOWPATH is counted so that it only counts freeing of objects by slab users that (for whatever reason) do not go to a percpu sheaf, and not all (including internal) callers of __slab_free(). Thus sheaf flushing (already counted by SHEAF_FLUSH) does not affect FREE_SLOWPATH anymore. This matches how ALLOC_SLOWPATH doesn't count sheaf refills (counted by SHEAF_REFILL). Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29mm/slub: remove DEACTIVATE_TO_* stat itemsVlastimil Babka1-16/+15
The cpu slabs and their deactivations were removed, so remove the unused stat items. Weirdly enough the values were also used to control __add_partial() adding to head or tail of the list, so replace that with a new enum add_mode, which is cleaner. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove frozen slab checks from __slab_free()Vlastimil Babka1-18/+4
Currently slabs are only frozen after consistency checks failed. This can happen only in caches with debugging enabled, and those use free_to_partial_list() for freeing. The non-debug operation of __slab_free() can thus stop considering the frozen field, and we can remove the FREE_FROZEN stat. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: update overview commentsVlastimil Babka1-74/+67
The changes related to sheaves made the description of locking and other details outdated. Update it to reflect current state. Also add a new copyright line due to major changes. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: refill sheaves from all nodesVlastimil Babka1-31/+106
__refill_objects() currently only attempts to get partial slabs from the local node and then allocates new slab(s). Expand it to trying also other nodes while observing the remote node defrag ratio, similarly to get_any_partial(). This will prevent allocating new slabs on a node while other nodes have many free slabs. It does mean sheaves will contain non-local objects in that case. Allocations that care about specific node will still be served appropriately, but might get a slowpath allocation. Like get_any_partial() we do observe cpuset_zone_allowed(), although we might be refilling a sheaf that will be then used from a different allocation context. We can also use the resulting refill_objects() in __kmem_cache_alloc_bulk() for non-debug caches. This means kmem_cache_alloc_bulk() will get better performance when sheaves are exhausted. kmem_cache_alloc_bulk() cannot indicate a preferred node so it's compatible with sheaves refill in preferring the local node. Its users also have gfp flags that allow spinning, so document that as a requirement. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove unused PREEMPT_RT specific macrosVlastimil Babka1-23/+1
The macros slub_get_cpu_ptr()/slub_put_cpu_ptr() are now unused, remove them. USE_LOCKLESS_FAST_PATH() has lost its true meaning with the code being removed. The only remaining usage is in fact testing whether we can assert irqs disabled, because spin_lock_irqsave() only does that on !RT. Test for CONFIG_PREEMPT_RT instead. Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove struct kmem_cache_cpuVlastimil Babka2-284/+27
The cpu slab is not used anymore for allocation or freeing, the remaining code is for flushing, but it's effectively dead. Remove the whole struct kmem_cache_cpu, the flushing code and other orphaned functions. The remaining used field of kmem_cache_cpu is the stat array with CONFIG_SLUB_STATS. Put it instead in a new struct kmem_cache_stats. In struct kmem_cache, the field is cpu_stats and placed near the end of the struct. Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: simplify kmalloc_nolock()Vlastimil Babka2-116/+29
The kmalloc_nolock() implementation has several complications and restrictions due to SLUB's cpu slab locking, lockless fastpath and PREEMPT_RT differences. With cpu slab usage removed, we can simplify things: - relax the PREEMPT_RT context checks as they were before commit 99a3e3a1cfc9 ("slab: fix kmalloc_nolock() context check for PREEMPT_RT") and also reference the explanation comment in the page allocator - the local_lock_cpu_slab() macros became unused, remove them - we no longer need to set up lockdep classes on PREEMPT_RT - we no longer need to annotate ___slab_alloc as NOKPROBE_SYMBOL since there's no lockless cpu freelist manipulation anymore - __slab_alloc_node() can be called from kmalloc_nolock_noprof() unconditionally. It can also no longer return EBUSY. But trylock failures can still happen so retry with the larger bucket if the allocation fails for any reason. Note that we still need __CMPXCHG_DOUBLE, because while it was removed we don't use cmpxchg16b on cpu freelist anymore, we still use it on slab freelist, and the alternative is slab_lock() which can be interrupted by a nmi. Clarify the comment to mention it specifically. Acked-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove defer_deactivate_slab()Vlastimil Babka4-44/+28
There are no more cpu slabs so we don't need their deferred deactivation. The function is now only used from places where we allocate a new slab but then can't spin on node list_lock to put it on the partial list. Instead of the deferred action we can free it directly via __free_slab(), we just need to tell it to use _nolock() freeing of the underlying pages and take care of the accounting. Since free_frozen_pages_nolock() variant does not yet exist for code outside of the page allocator, create it as a trivial wrapper for __free_frozen_pages(..., FPI_TRYLOCK). Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove the do_slab_free() fastpathVlastimil Babka1-136/+13
We have removed cpu slab usage from allocation paths. Now remove do_slab_free() which was freeing objects to the cpu slab when the object belonged to it. Instead call __slab_free() directly, which was previously the fallback. This simplifies kfree_nolock() - when freeing to percpu sheaf fails, we can call defer_free() directly. Also remove functions that became unused. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove SLUB_CPU_PARTIALVlastimil Babka3-342/+19
We have removed the partial slab usage from allocation paths. Now remove the whole config option and associated code. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: remove cpu (partial) slabs usage from allocation pathsVlastimil Babka1-541/+87
We now rely on sheaves as the percpu caching layer and can refill them directly from partial or newly allocated slabs. Start removing the cpu (partial) slabs code, first from allocation paths. This means that any allocation not satisfied from percpu sheaves will end up in ___slab_alloc(), where we remove the usage of cpu (partial) slabs, so it will only perform get_partial() or new_slab(). In the latter case we reuse alloc_from_new_slab() (when we don't use the debug/tiny alloc_single_from_new_slab() variant). In get_partial_node() we used to return a slab for freezing as the cpu slab and to refill the partial slab. Now we only want to return a single object and leave the slab on the list (unless it became full). We can't simply reuse alloc_single_from_partial() as that assumes freeing uses free_to_partial_list(). Instead we need to use __slab_update_freelist() to work properly against a racing __slab_free(). To reflect the new purpose of get_partial() functions, rename them to get_from_partial(), get_from_partial_node(), and get_from_any_partial(). The rest of the changes is removing functions that no longer have any callers. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: add optimized sheaf refill from partial listVlastimil Babka1-21/+272
At this point we have sheaves enabled for all caches, but their refill is done via __kmem_cache_alloc_bulk() which relies on cpu (partial) slabs - now a redundant caching layer that we are about to remove. The refill will thus be done from slabs on the node partial list. Introduce new functions that can do that in an optimized way as it's easier than modifying the __kmem_cache_alloc_bulk() call chain. Introduce struct partial_bulk_context, a variant of struct partial_context that can return a list of slabs from the partial list with the sum of free objects in them within the requested min and max. Introduce get_partial_node_bulk() that removes the slabs from freelist and returns them in the list. There is a racy read of slab->counters so make sure the non-atomic write in __update_freelist_slow() is not tearing. Introduce get_freelist_nofreeze() which grabs the freelist without freezing the slab. Introduce alloc_from_new_slab() which can allocate multiple objects from a newly allocated slab where we don't need to synchronize with freeing. In some aspects it's similar to alloc_single_from_new_slab() but assumes the cache is a non-debug one so it can avoid some actions. It supports the allow_spin parameter, which we always set true here, but the followup change will reuse the function in a context where it may be false. Introduce __refill_objects() that uses the functions above to fill an array of objects. It has to handle the possibility that the slabs will contain more objects that were requested, due to concurrent freeing of objects to those slabs. When no more slabs on partial lists are available, it will allocate new slabs. It is intended to be only used in context where spinning is allowed, so add a WARN_ON_ONCE check there. Finally, switch refill_sheaf() to use __refill_objects(). Sheaves are only refilled from contexts that allow spinning, or even blocking. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: handle kmalloc sheaves bootstrapVlastimil Babka1-4/+84
Enable sheaves for kmalloc caches. For other types than KMALLOC_NORMAL, we can simply allow them in calculate_sizes() as they are created later than KMALLOC_NORMAL caches and can allocate sheaves and barns from those. For KMALLOC_NORMAL caches we perform additional step after first creating them without sheaves. Then bootstrap_cache_sheaves() simply allocates and initializes barns and sheaves and finally sets s->sheaf_capacity to make them actually used. Afterwards the only caches left without sheaves (unless SLUB_TINY or debugging is enabled) are kmem_cache and kmem_cache_node. These are only used when creating or destroying other kmem_caches. Thus they are not performance critical and we can simply leave it that way. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: make percpu sheaves compatible with kmalloc_nolock()/kfree_nolock()Vlastimil Babka1-22/+60
Before we enable percpu sheaves for kmalloc caches, we need to make sure kmalloc_nolock() and kfree_nolock() will continue working properly and not spin when not allowed to. Percpu sheaves themselves use local_trylock() so they are already compatible. We just need to be careful with the barn->lock spin_lock. Pass a new allow_spin parameter where necessary to use spin_trylock_irqsave(). In kmalloc_nolock_noprof() we can now attempt alloc_from_pcs() safely, for now it will always fail until we enable sheaves for kmalloc caches next. Similarly in kfree_nolock() we can attempt free_to_pcs(). Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: introduce percpu sheaves bootstrapVlastimil Babka3-47/+97
Until now, kmem_cache->cpu_sheaves was !NULL only for caches with sheaves enabled. Since we want to enable them for almost all caches, it's suboptimal to test the pointer in the fast paths, so instead allocate it for all caches in do_kmem_cache_create(). Instead of testing the cpu_sheaves pointer to recognize caches (yet) without sheaves, test kmem_cache->sheaf_capacity for being 0, where needed, using a new cache_has_sheaves() helper. However, for the fast paths sake we also assume that the main sheaf always exists (pcs->main is !NULL), and during bootstrap we cannot allocate sheaves yet. Solve this by introducing a single static bootstrap_sheaf that's assigned as pcs->main during bootstrap. It has a size of 0, so during allocations, the fast path will find it's empty. Since the size of 0 matches sheaf_capacity of 0, the freeing fast paths will find it's "full". In the slow path handlers, we use cache_has_sheaves() to recognize that the cache doesn't (yet) have real sheaves, and fall back. Thus sharing the single bootstrap sheaf like this for multiple caches and cpus is safe. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-29slab: add sheaves to most cachesVlastimil Babka2-10/+52
In the first step to replace cpu (partial) slabs with sheaves, enable sheaves for almost all caches. Treat args->sheaf_capacity as a minimum, and calculate sheaf capacity with a formula that roughly follows the formula for number of objects in cpu partial slabs in set_cpu_partial(). This should achieve roughly similar contention on the barn spin lock as there's currently for node list_lock without sheaves, to make benchmarking results comparable. It can be further tuned later. Don't enable sheaves for bootstrap caches as that wouldn't work. In order to recognize them by SLAB_NO_OBJ_EXT, make sure the flag exists even for !CONFIG_SLAB_OBJ_EXT. This limitation will be lifted for kmalloc caches after the necessary bootstrapping changes. Also do not enable sheaves for SLAB_NOLEAKTRACE caches to avoid recursion with kmemleak tracking (thanks to Breno Leitao). Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Tested-by: Breno Leitao <leitao@debian.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Tested-by: Zhao Liu <zhao1.liu@intel.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27slub: keep empty main sheaf as spare in __pcs_replace_empty_main()Hao Li1-1/+4
When __pcs_replace_empty_main() fails to obtain a full sheaf directly from the barn, it may either: - Refill an empty sheaf obtained via barn_get_empty_sheaf(), or - Allocate a brand new full sheaf via alloc_full_sheaf(). After reacquiring the per-CPU lock, if pcs->main is still empty and pcs->spare is NULL, the current code donates the empty main sheaf to the barn via barn_put_empty_sheaf() and installs the full sheaf as pcs->main, leaving pcs->spare unpopulated. Instead, keep the existing empty main sheaf locally as the spare: pcs->spare = pcs->main; pcs->main = full; This populates pcs->spare earlier, which can reduce future barn traffic. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hao Li <haolee.swjtu@gmail.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Zhao Liu <zhao1.liu@intel.com>
2026-01-27mm/slab: factor out slab_args_unmergeable()Harry Yoo1-10/+18
slab_mergeable() determines whether a slab cache can be merged, but it should not be used when the cache is not fully created yet. Extract the pre-cache-creation mergeability checks into slab_args_unmergeable(), which evaluates kmem_cache_args, slab flags, and slab_nomerge to determine if a cache will be mergeable before it is created. Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260127103151.21883-2-harry.yoo@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27mm/slab: make caches with sheaves mergeableVlastimil Babka1-6/+7
Before enabling sheaves for all caches (with automatically determined capacity), their enablement should no longer prevent merging of caches. Limit this merge prevention only to caches that were created with a specific sheaf capacity, by adding the SLAB_NO_MERGE flag to them. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27mm/slab: move and refactor __kmem_cache_alias()Vlastimil Babka3-41/+41
Move __kmem_cache_alias() to slab_common.c since it's called by __kmem_cache_create_args() and calls find_mergeable() that both are in this file. We can remove two slab.h declarations and make them static. Instead declare sysfs_slab_alias() from slub.c so that __kmem_cache_alias() can keep calling it. Add args parameter to __kmem_cache_alias() and find_mergeable() instead of align and ctor. With that we can also move the checks for usersize and sheaf_capacity there from __kmem_cache_create_args() and make the result more symmetric with slab_unmergeable(). No functional changes intended. Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27slab: add SLAB_CONSISTENCY_CHECKS to SLAB_NEVER_MERGEVlastimil Babka1-3/+2
All the debug flags prevent merging, except SLAB_CONSISTENCY_CHECKS. This is suboptimal because this flag (like any debug flags) prevents the usage of any fastpaths, and thus affect performance of any aliased cache. Also the objects from an aliased cache than the one specified for debugging could also interfere with the debugging efforts. Fix this by adding the whole SLAB_DEBUG_FLAGS collection to SLAB_NEVER_MERGE instead of individual debug flags, so it now also includes SLAB_CONSISTENCY_CHECKS. Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27mm/slab: fix false lockdep warning in __kfree_rcu_sheaf()Harry Yoo1-0/+20
kvfree_call_rcu() can be called while holding a raw_spinlock_t. Since __kfree_rcu_sheaf() may acquire a spinlock_t (which becomes a sleeping lock on PREEMPT_RT) and violate lock nesting rules, kvfree_call_rcu() bypasses the sheaves layer entirely on PREEMPT_RT. However, lockdep still complains about acquiring spinlock_t while holding raw_spinlock_t, even on !PREEMPT_RT where spinlock_t is a spinning lock. This causes a false lockdep warning [1]: ============================= [ BUG: Invalid wait context ] 6.19.0-rc6-next-20260120 #21508 Not tainted ----------------------------- migration/1/23 is trying to lock: ffff8afd01054e98 (&barn->lock){..-.}-{3:3}, at: barn_get_empty_sheaf+0x1d/0xb0 other info that might help us debug this: context-{5:5} 3 locks held by migration/1/23: #0: ffff8afd01fd89a8 (&p->pi_lock){-.-.}-{2:2}, at: __balance_push_cpu_stop+0x3f/0x200 #1: ffffffff9f15c5c8 (rcu_read_lock){....}-{1:3}, at: cpuset_cpus_allowed_fallback+0x27/0x250 #2: ffff8afd1f470be0 ((local_lock_t *)&pcs->lock){+.+.}-{3:3}, at: __kfree_rcu_sheaf+0x52/0x3d0 stack backtrace: CPU: 1 UID: 0 PID: 23 Comm: migration/1 Not tainted 6.19.0-rc6-next-20260120 #21508 PREEMPTLAZY Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Stopper: __balance_push_cpu_stop+0x0/0x200 <- balance_push+0x118/0x170 Call Trace: <TASK> __dump_stack+0x22/0x30 dump_stack_lvl+0x60/0x80 dump_stack+0x19/0x24 __lock_acquire+0xd3a/0x28e0 ? __lock_acquire+0x5a9/0x28e0 ? __lock_acquire+0x5a9/0x28e0 ? barn_get_empty_sheaf+0x1d/0xb0 lock_acquire+0xc3/0x270 ? barn_get_empty_sheaf+0x1d/0xb0 ? __kfree_rcu_sheaf+0x52/0x3d0 _raw_spin_lock_irqsave+0x47/0x70 ? barn_get_empty_sheaf+0x1d/0xb0 barn_get_empty_sheaf+0x1d/0xb0 ? __kfree_rcu_sheaf+0x52/0x3d0 __kfree_rcu_sheaf+0x19f/0x3d0 kvfree_call_rcu+0xaf/0x390 set_cpus_allowed_force+0xc8/0xf0 [...] </TASK> This wasn't triggered until sheaves were enabled for all slab caches, since kfree_rcu() wasn't being called with a raw spinlock held for caches with sheaves (vma, maple node). As suggested by Vlastimil Babka, fix this by using a lockdep map with LD_WAIT_CONFIG wait type to tell lockdep that acquiring spinlock_t is valid in this case, as those spinlocks won't be used on PREEMPT_RT. Note that kfree_rcu_sheaf_map should be acquired using _try() variant, otherwise the acquisition of the lockdep map itself will trigger an invalid wait context warning. Reported-by: Paul E. McKenney <paulmck@kernel.org> Closes: https://lore.kernel.org/linux-mm/c858b9af-2510-448b-9ab3-058f7b80dd42@paulmck-laptop [1] Fixes: ec66e0d59952 ("slab: add sheaf support for batching kfree_rcu() operations") Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27mm/slab: add rcu_barrier() to kvfree_rcu_barrier_on_cache()Vlastimil Babka1-1/+4
After we submit the rcu_free sheaves to call_rcu() we need to make sure the rcu callbacks complete. kvfree_rcu_barrier() does that via flush_all_rcu_sheaves() but kvfree_rcu_barrier_on_cache() doesn't. Fix that. This currently causes no issues because the caches with sheaves we have are never destroyed. The problem flagged by kernel test robot was reported for a patch that enables sheaves for (almost) all caches, and occurred only with CONFIG_KASAN. Harry Yoo found the root cause [1]: It turns out the object freed by sheaf_flush_unused() was in KASAN percpu quarantine list (confirmed by dumping the list) by the time __kmem_cache_shutdown() returns an error. Quarantined objects are supposed to be flushed by kasan_cache_shutdown(), but things go wrong if the rcu callback (rcu_free_sheaf_nobarn()) is processed after kasan_cache_shutdown() finishes. That's why rcu_barrier() in __kmem_cache_shutdown() didn't help, because it's called after kasan_cache_shutdown(). Calling rcu_barrier() in kvfree_rcu_barrier_on_cache() guarantees that it'll be added to the quarantine list before kasan_cache_shutdown() is called. So it's a valid fix! [1] https://lore.kernel.org/all/aWd6f3jERlrB5yeF@hyeyoo/ Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202601121442.c530bed3-lkp@intel.com Fixes: 0f35040de593 ("mm/slab: introduce kvfree_rcu_barrier_on_cache() for cache destruction") Cc: stable@vger.kernel.org Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-27slab: replace cache_from_obj() with inline checksVlastimil Babka1-23/+33
Eric Dumazet has noticed cache_from_obj() is not inlined with clang and suggested splitting it into two functions, where the smaller inlined one assumes the fastpath is !CONFIG_SLAB_FREELIST_HARDENED. However most distros enable it these days and so this would likely add a function call to the object free fastpaths. Instead take a step back and consider that cache_from_obj() is a relict from when memcgs created their separate kmem_cache copies, as the outdated comment in build_detached_freelist() reminds us. Meanwhile hardening/debugging had reused cache_from_obj() to validate that the freed object really belongs to a slab from the cache we think we are freeing from. In build_detached_freelist() simply remove this, because it did not handle the NULL result from cache_from_obj() failure properly, nor validate objects (for the NULL slab->slab_cache pointer) when called via kfree_bulk(). If anyone is motivated to implement it properly, it should be possible in a similar way to kmem_cache_free(). In kmem_cache_free(), do the hardening/debugging checks directly so they are inlined by definition and virt_to_slab(obj) is performed just once. In case they failed, call a newly introduced warn_free_bad_obj() that performs the warnings outside of the fastpath, and leak the object. As an intentional change, leak the object when slab->slab_cache differs from the cache given to kmem_cache_free(). Previously we would only leak when the object is not in a valid slab page or the slab->slab_cache pointer is NULL, and otherwise trust the slab->slab_cache over the kmem_cache_free() argument. But if those differ, it means something went wrong enough that it's best not to continue freeing. As a result the fastpath should be inlined in all configs and the warnings are moved away. Reported-by: Eric Dumazet <edumazet@google.com> Closes: https://lore.kernel.org/all/20260115130642.3419324-1-edumazet@google.com/ Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Hao Li <hao.li@linux.dev> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-21slab: fix kmalloc_nolock() context check for PREEMPT_RTSwaraj Gaikwad1-2/+6
On PREEMPT_RT kernels, local_lock becomes a sleeping lock. The current check in kmalloc_nolock() only verifies we're not in NMI or hard IRQ context, but misses the case where preemption is disabled. When a BPF program runs from a tracepoint with preemption disabled (preempt_count > 0), kmalloc_nolock() proceeds to call local_lock_irqsave() which attempts to acquire a sleeping lock, triggering: BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6128 preempt_count: 2, expected: 0 Fix this by checking !preemptible() on PREEMPT_RT, which directly expresses the constraint that we cannot take a sleeping lock when preemption is disabled. This encompasses the previous checks for NMI and hard IRQ contexts while also catching cases where preemption is disabled. Fixes: af92793e52c3 ("slab: Introduce kmalloc_nolock() and kfree_nolock().") Reported-by: syzbot+b1546ad4a95331b2101e@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b1546ad4a95331b2101e Signed-off-by: Swaraj Gaikwad <swarajgaikwad1925@gmail.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Harry Yoo <harry.yoo@oracle.com> Link: https://patch.msgid.link/20260113150639.48407-1-swarajgaikwad1925@gmail.co Cc: <stable@vger.kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2026-01-12Linux 6.19-rc5v6.19-rc5Linus Torvalds1-1/+1
2026-01-12Merge tag 'libcrypto-for-linus' of ↵Linus Torvalds4-4/+5
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull crypto library fixes from Eric Biggers: - A couple more fixes for the lib/crypto KUnit tests - Fix missing MMU protection for the AES S-box * tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: lib/crypto: aes: Fix missing MMU protection for AES S-box MAINTAINERS: add test vector generation scripts to "CRYPTO LIBRARY" lib/crypto: tests: Fix syntax error for old python versions lib/crypto: tests: polyval_kunit: Increase iterations for preparekey in IRQs
2026-01-11Merge tag 'char-misc-6.19-rc5' of ↵Linus Torvalds5-11/+19
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char/misc driver fixes for some reported issues. Included in here is: - much reported rust_binder fix - counter driver fixes - new device ids for the mei driver All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-6.19-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: rust_binder: remove spin_lock() in rust_shrink_free_page() mei: me: add nova lake point S DID counter: 104-quad-8: Fix incorrect return value in IRQ handler counter: interrupt-cnt: Drop IRQF_NO_THREAD flag
2026-01-11Merge tag 'x86-urgent-2026-01-11' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "Disable GCOV instrumentation in the SEV noinstr.c collection of SEV noinstr methods, to further robustify the code" * tag 'x86-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sev: Disable GCOV on noinstr object
2026-01-11Merge tag 'sched-urgent-2026-01-11' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Fix a crash in sched_mm_cid_after_execve()" * tag 'sched-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/mm_cid: Prevent NULL mm dereference in sched_mm_cid_after_execve()
2026-01-11Merge tag 'perf-urgent-2026-01-11' of ↵Linus Torvalds1-0/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event fix from Ingo Molnar: "Fix perf swevent hrtimer deinit regression" * tag 'perf-urgent-2026-01-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Ensure swevent hrtimer is properly destroyed