summaryrefslogtreecommitdiff
path: root/mm/slub.c
AgeCommit message (Collapse)AuthorFilesLines
2022-09-23Merge branch 'slab/for-6.1/slub_validation_locking' into slab/for-nextVlastimil Babka1-198/+276
My series [1] to fix validation races for caches with enabled debugging. By decoupling the debug cache operation more from non-debug fastpaths, additional locking simplifications were possible and done afterwards. Additional cleanup of PREEMPT_RT specific code on top, by Thomas Gleixner. [1] https://lore.kernel.org/all/20220823170400.26546-1-vbabka@suse.cz/
2022-09-23Merge branch 'slab/for-6.1/common_kmalloc' into slab/for-nextVlastimil Babka1-225/+13
The "common kmalloc v4" series [1] by Hyeonggon Yoo. - Improves the mm/slab_common.c wrappers to allow deleting duplicated code between SLAB and SLUB. - Large kmalloc() allocations in SLAB are passed to page allocator like in SLUB, reducing number of kmalloc caches. - Removes the {kmem_cache_alloc,kmalloc}_node variants of tracepoints, node id parameter added to non-_node variants. - 8 files changed, 341 insertions(+), 651 deletions(-) [1] https://lore.kernel.org/all/20220817101826.236819-1-42.hyeyoo@gmail.com/ -- Merge resolves trivial conflict in mm/slub.c with commit 5373b8a09d6e ("kasan: call kasan_malloc() from __kmalloc_*track_caller()")
2022-09-23Merge branch 'slab/for-6.1/trivial' into slab/for-nextVlastimil Babka1-7/+2
Trivial fixes and cleanups: - unneeded variable removals, by ye xingchen
2022-09-22mm: slub: fix flush_cpu_slab()/__free_slab() invocations in task context.Maurizio Lombardi1-1/+8
Commit 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") moved all flush_cpu_slab() invocations to the global workqueue to avoid a problem related with deactivate_slab()/__free_slab() being called from an IRQ context on PREEMPT_RT kernels. When the flush_all_cpu_locked() function is called from a task context it may happen that a workqueue with WQ_MEM_RECLAIM bit set ends up flushing the global workqueue, this will cause a dependency issue. workqueue: WQ_MEM_RECLAIM nvme-delete-wq:nvme_delete_ctrl_work [nvme_core] is flushing !WQ_MEM_RECLAIM events:flush_cpu_slab WARNING: CPU: 37 PID: 410 at kernel/workqueue.c:2637 check_flush_dependency+0x10a/0x120 Workqueue: nvme-delete-wq nvme_delete_ctrl_work [nvme_core] RIP: 0010:check_flush_dependency+0x10a/0x120[ 453.262125] Call Trace: __flush_work.isra.0+0xbf/0x220 ? __queue_work+0x1dc/0x420 flush_all_cpus_locked+0xfb/0x120 __kmem_cache_shutdown+0x2b/0x320 kmem_cache_destroy+0x49/0x100 bioset_exit+0x143/0x190 blk_release_queue+0xb9/0x100 kobject_cleanup+0x37/0x130 nvme_fc_ctrl_free+0xc6/0x150 [nvme_fc] nvme_free_ctrl+0x1ac/0x2b0 [nvme_core] Fix this bug by creating a workqueue for the flush operation with the WQ_MEM_RECLAIM bit set. Fixes: 5a836bf6b09f ("mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context") Cc: <stable@vger.kernel.org> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-17slub: Make PREEMPT_RT support less convolutedThomas Gleixner1-32/+24
The slub code already has a few helpers depending on PREEMPT_RT. Add a few more and get rid of the CONFIG_PREEMPT_RT conditionals all over the place. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: linux-mm@kvack.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-17mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock()Vlastimil Babka1-27/+12
The PREEMPT_RT specific disabling of irqs in __cmpxchg_double_slab() (through slab_[un]lock()) is unnecessary as bit_spin_lock() disables preemption and that's sufficient on PREEMPT_RT where no allocation/free operation is performed in hardirq context and so can't interrupt the current operation. That means we no longer need the slab_[un]lock() wrappers, so delete them and rename the current __slab_[un]lock() to slab_[un]lock(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2022-09-17mm/slub: convert object_map_lock to non-raw spinlockVlastimil Babka1-30/+6
The only remaining user of object_map_lock is list_slab_objects(). Obtaining the lock there used to happen under slab_lock() which implied disabling irqs on PREEMPT_RT, thus it's a raw_spinlock. With the slab_lock() removed, we can convert it to a normal spinlock. Also remove the get_map()/put_map() wrappers as list_slab_objects() became their only remaining user. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2022-09-17mm/slub: remove slab_lock() usage for debug operationsVlastimil Babka1-11/+8
All alloc and free operations on debug caches are now serialized by n->list_lock, so we can remove slab_lock() usage in validate_slab() and list_slab_objects() as those also happen under n->list_lock. Note the usage in list_slab_objects() could happen even on non-debug caches, but only during cache shutdown time, so there should not be any parallel freeing activity anymore. Except for buggy slab users, but in that case the slab_lock() would not help against the common cmpxchg based fast paths (in non-debug caches) anyway. Also adjust documentation comments accordingly. Suggested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
2022-09-17mm/slub: restrict sysfs validation to debug caches and make it safeVlastimil Babka1-52/+180
Rongwei Wang reports [1] that cache validation triggered by writing to /sys/kernel/slab/<cache>/validate is racy against normal cache operations (e.g. freeing) in a way that can cause false positive inconsistency reports for caches with debugging enabled. The problem is that debugging actions that mark object free or active and actual freelist operations are not atomic, and the validation can see an inconsistent state. For caches that do or don't have debugging enabled, additional races involving n->nr_slabs are possible that result in false reports of wrong slab counts. This patch attempts to solve these issues while not adding overhead to normal (especially fastpath) operations for caches that do not have debugging enabled. Such overhead would not be justified to make possible userspace-triggered validation safe. Instead, disable the validation for caches that don't have debugging enabled and make their sysfs validate handler return -EINVAL. For caches that do have debugging enabled, we can instead extend the existing approach of not using percpu freelists to force all alloc/free operations to the slow paths where debugging flags is checked and acted upon. There can adjust the debug-specific paths to increase n->list_lock coverage against concurrent validation as necessary. The processing on free in free_debug_processing() already happens under n->list_lock so we can extend it to actually do the freeing as well and thus make it atomic against concurrent validation. As observed by Hyeonggon Yoo, we do not really need to take slab_lock() anymore here because all paths we could race with are protected by n->list_lock under the new scheme, so drop its usage here. The processing on alloc in alloc_debug_processing() currently doesn't take any locks, but we have to first allocate the object from a slab on the partial list (as debugging caches have no percpu slabs) and thus take the n->list_lock anyway. Add a function alloc_single_from_partial() that grabs just the allocated object instead of the whole freelist, and does the debug processing. The n->list_lock coverage again makes it atomic against validation and it is also ultimately more efficient than the current grabbing of freelist immediately followed by slab deactivation. To prevent races on n->nr_slabs updates, make sure that for caches with debugging enabled, inc_slabs_node() or dec_slabs_node() is called under n->list_lock. When allocating a new slab for a debug cache, handle the allocation by a new function alloc_single_from_new_slab() instead of the current forced deactivation path. Neither of these changes affect the fast paths at all. The changes in slow paths are negligible for non-debug caches. [1] https://lore.kernel.org/all/20220529081535.69275-1-rongwei.wang@linux.alibaba.com/ Reported-by: Rongwei Wang <rongwei.wang@linux.alibaba.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-09-17kasan: call kasan_malloc() from __kmalloc_*track_caller()Peter Collingbourne1-0/+4
We were failing to call kasan_malloc() from __kmalloc_*track_caller() which was causing us to sometimes fail to produce KASAN error reports for allocations made using e.g. devm_kcalloc(), as the KASAN poison was not being initialized. Fix it. Signed-off-by: Peter Collingbourne <pcc@google.com> Cc: <stable@vger.kernel.org> # 5.15 Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-12kfence: add sysfs interface to disable kfence for selected slabs.Imran Khan1-0/+26
By default kfence allocation can happen for any slab object, whose size is up to PAGE_SIZE, as long as that allocation is the first allocation after expiration of kfence sample interval. But in certain debugging scenarios we may be interested in debugging corruptions involving some specific slub objects like dentry or ext4_* etc. In such cases limiting kfence for allocations involving only specific slub objects will increase the probablity of catching the issue since kfence pool will not be consumed by other slab objects. This patch introduces a sysfs interface '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific slabs. Having the interface work in this way does not impact current/default behavior of kfence and allows us to use kfence for specific slabs (when needed) as well. The decision to skip/use kfence is taken depending on whether kmem_cache.flags has (newly introduced) SLAB_SKIP_KFENCE flag set or not. Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.com Signed-off-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-09mm/slub: fix to return errno if kmalloc() failsChao Yu1-1/+4
In create_unique_id(), kmalloc(, GFP_KERNEL) can fail due to out-of-memory, if it fails, return errno correctly rather than triggering panic via BUG_ON(); kernel BUG at mm/slub.c:5893! Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Call trace: sysfs_slab_add+0x258/0x260 mm/slub.c:5973 __kmem_cache_create+0x60/0x118 mm/slub.c:4899 create_cache mm/slab_common.c:229 [inline] kmem_cache_create_usercopy+0x19c/0x31c mm/slab_common.c:335 kmem_cache_create+0x1c/0x28 mm/slab_common.c:390 f2fs_kmem_cache_create fs/f2fs/f2fs.h:2766 [inline] f2fs_init_xattr_caches+0x78/0xb4 fs/f2fs/xattr.c:808 f2fs_fill_super+0x1050/0x1e0c fs/f2fs/super.c:4149 mount_bdev+0x1b8/0x210 fs/super.c:1400 f2fs_mount+0x44/0x58 fs/f2fs/super.c:4512 legacy_get_tree+0x30/0x74 fs/fs_context.c:610 vfs_get_tree+0x40/0x140 fs/super.c:1530 do_new_mount+0x1dc/0x4e4 fs/namespace.c:3040 path_mount+0x358/0x914 fs/namespace.c:3370 do_mount fs/namespace.c:3383 [inline] __do_sys_mount fs/namespace.c:3591 [inline] __se_sys_mount fs/namespace.c:3568 [inline] __arm64_sys_mount+0x2f8/0x408 fs/namespace.c:3568 Cc: <stable@kernel.org> Fixes: 81819f0fc8285 ("SLUB core") Reported-by: syzbot+81684812ea68216e08c5@syzkaller.appspotmail.com Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Chao Yu <chao.yu@oppo.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/slab_common: drop kmem_alloc & avoid dereferencing fields when not usingHyeonggon Yoo1-5/+3
Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc using TRACE_EVENT() macro. And then this patch does: - Do not pass pointer to struct kmem_cache to trace_kmalloc. gfp flag is enough to know if it's accounted or not. - Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event. - Avoid dereferencing s->name in when not using kmem_cache_free event. - Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB Cc: Vasily Averin <vasily.averin@linux.dev> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/slab_common: unify NUMA and UMA version of tracepointsHyeonggon Yoo1-3/+3
Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and remove _node postfix for NUMA version of tracepoints. This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node, but at this point maintaining both kmem_alloc and kmem_alloc_node event classes does not makes sense at all. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/sl[au]b: cleanup kmem_cache_alloc[_node]_trace()Hyeonggon Yoo1-27/+0
Despite its name, kmem_cache_alloc[_node]_trace() is hook for inlined kmalloc. So rename it to kmalloc[_node]_trace(). Move its implementation to slab_common.c by using __kmem_cache_alloc_node(), but keep CONFIG_TRACING=n varients to save a function call when CONFIG_TRACING=n. Use __assume_kmalloc_alignment for kmalloc[_node]_trace instead of __assume_slab_alignement. Generally kmalloc has larger alignment requirements. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-09-01mm/sl[au]b: generalize kmalloc subsystemHyeonggon Yoo1-87/+0
Now everything in kmalloc subsystem can be generalized. Let's do it! Generalize __do_kmalloc_node(), __kmalloc_node_track_caller(), kfree(), __ksize(), __kmalloc(), __kmalloc_node() and move them to slab_common.c. In the meantime, rename kmalloc_large_node_notrace() to __kmalloc_large_node() and make it static as it's now only called in slab_common.c. [ feng.tang@intel.com: adjust kfence skip list to include __kmem_cache_free so that kfence kunit tests do not fail ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-25mm/slub: move free_debug_processing() furtherVlastimil Babka1-57/+57
In the following patch, the function free_debug_processing() will be calling add_partial(), remove_partial() and discard_slab(), se move it below their definitions to avoid forward declarations. To make review easier, separate the move from functional changes. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
2022-08-24mm/sl[au]b: introduce common alloc/free functions without tracepointHyeonggon Yoo1-0/+13
To unify kmalloc functions in later patch, introduce common alloc/free functions that does not have tracepoint. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab: kmalloc: pass requests larger than order-1 page to page allocatorHyeonggon Yoo1-19/+0
There is not much benefit for serving large objects in kmalloc(). Let's pass large requests to page allocator like SLUB for better maintenance of common code. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: kmalloc_node: pass large requests to page allocatorHyeonggon Yoo1-1/+1
Now that kmalloc_large_node() is in common code, pass large requests to page allocator in kmalloc_node() using kmalloc_large_node(). One problem is that currently there is no tracepoint in kmalloc_large_node(). Instead of simply putting tracepoint in it, use kmalloc_large_node{,_notrace} depending on its caller to show useful address for both inlined kmalloc_node() and __kmalloc_node_track_caller() when large objects are allocated. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slub: move kmalloc_large_node() to slab_common.cHyeonggon Yoo1-25/+0
In later patch SLAB will also pass requests larger than order-1 page to page allocator. Move kmalloc_large_node() to slab_common.c. Fold kmalloc_large_node_hook() into kmalloc_large_node() as there is no other caller. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/sl[au]b: factor out __do_kmalloc_node()Hyeonggon Yoo1-52/+19
__kmalloc(), __kmalloc_node(), __kmalloc_node_track_caller() mostly do same job. Factor out common code into __do_kmalloc_node(). Note that this patch also fixes missing kasan_kmalloc() in SLUB's __kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: cleanup kmalloc_track_caller()Hyeonggon Yoo1-22/+0
Make kmalloc_track_caller() wrapper of kmalloc_node_track_caller(). Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-24mm/slab_common: remove CONFIG_NUMA ifdefs for common kmalloc functionsHyeonggon Yoo1-6/+0
Now that slab_alloc_node() is available for SLAB when CONFIG_NUMA=n, remove CONFIG_NUMA ifdefs for common kmalloc functions. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-08-23mm/slub: Remove the unneeded result variableye xingchen1-7/+2
Return the value from attribute->store(s, buf, len) and attribute->show(s, buf) directly instead of storing it in another redundant variable. Reported-by: Zeal Robot <zealci@zte.com.cn> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-20mm/sl[au]b: use own bulk free function when bulk alloc failedHyeonggon Yoo1-2/+2
There is no benefit to call generic bulk free function when kmem_cache_alloc_bulk() failed. Use own kmem_cache_free_bulk() instead of generic function. Note that if kmem_cache_alloc_bulk() fails to allocate first object in SLUB, size is zero. So allow passing size == 0 to kmem_cache_free_bulk() like SLAB's. Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm: slab: optimize memcg_slab_free_hook()Muchun Song1-44/+22
Most callers of memcg_slab_free_hook() already know the slab, which could be passed to memcg_slab_free_hook() directly to reduce the overhead of an another call of virt_to_slab(). For bulk freeing of objects, the call of slab_objcgs() in the loop in memcg_slab_free_hook() is redundant as well. Rework memcg_slab_free_hook() and build_detached_freelist() to reduce those unnecessary overhead and make memcg_slab_free_hook() can handle bulk freeing in slab_free(). Move the calling site of memcg_slab_free_hook() from do_slab_free() to slab_free() for slub to make the code clearer since the logic is weird (e.g. the caller need to judge whether it needs to call memcg_slab_free_hook()). It is easy to make mistakes like missing calling of memcg_slab_free_hook() like fixes of: commit d1b2cf6cb84a ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()") commit ae085d7f9365 ("mm: kfence: fix missing objcg housekeeping for SLAB") This optimization is mainly for bulk objects freeing. The following numbers is shown for 16-object freeing. before after kmem_cache_free_bulk: ~430 ns ~400 ns The overhead is reduced by about 7% for 16-object freeing. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/20220429123044.37885-1-songmuchun@bytedance.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm/tracing: add 'accounted' entry into output of allocation tracepointsVasily Averin1-10/+10
Slab caches marked with SLAB_ACCOUNT force accounting for every allocation from this cache even if __GFP_ACCOUNT flag is not passed. Unfortunately, at the moment this flag is not visible in ftrace output, and this makes it difficult to analyze the accounted allocations. This patch adds boolean "accounted" entry into trace output, and set it to 'true' for calls used __GFP_ACCOUNT flag and for allocations from caches marked with SLAB_ACCOUNT. Set it to 'false' if accounting is disabled in configs. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Link: https://lore.kernel.org/r/c418ed25-65fe-f623-fbf8-1676528859ed@openvz.org Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-07-04mm/slub: Simplify __kmem_cache_alias()Xiongwei Song1-5/+3
There is no need to do anything if sysfs_slab_alias() return nonzero value after getting a mergeable cache. Signed-off-by: Xiongwei Song <xiongwei.song@windriver.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Link: https://lore.kernel.org/all/e5ebc952-af17-321f-5343-bc914d47c931@suse.cz/ Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
2022-06-13mm/slub: add missing TID updates on slab deactivationJann Horn1-0/+2
The fastpath in slab_alloc_node() assumes that c->slab is stable as long as the TID stays the same. However, two places in __slab_alloc() currently don't update the TID when deactivating the CPU slab. If multiple operations race the right way, this could lead to an object getting lost; or, in an even more unlikely situation, it could even lead to an object being freed onto the wrong slab's freelist, messing up the `inuse` counter and eventually causing a page to be freed to the page allocator while it still contains slab objects. (I haven't actually tested these cases though, this is just based on looking at the code. Writing testcases for this stuff seems like it'd be a pain...) The race leading to state inconsistency is (all operations on the same CPU and kmem_cache): - task A: begin do_slab_free(): - read TID - read pcpu freelist (==NULL) - check `slab == c->slab` (true) - [PREEMPT A->B] - task B: begin slab_alloc_node(): - fastpath fails (`c->freelist` is NULL) - enter __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - take local_lock_irqsave() - read c->freelist as NULL - get_freelist() returns NULL - write `c->slab = NULL` - drop local_unlock_irqrestore() - goto new_slab - slub_percpu_partial() is NULL - get_partial() returns NULL - slub_put_cpu_ptr() (enables preemption) - [PREEMPT B->A] - task A: finish do_slab_free(): - this_cpu_cmpxchg_double() succeeds() - [CORRUPT STATE: c->slab==NULL, c->freelist!=NULL] From there, the object on c->freelist will get lost if task B is allowed to continue from here: It will proceed to the retry_load_slab label, set c->slab, then jump to load_freelist, which clobbers c->freelist. But if we instead continue as follows, we get worse corruption: - task A: run __slab_free() on object from other struct slab: - CPU_PARTIAL_FREE case (slab was on no list, is now on pcpu partial) - task A: run slab_alloc_node() with NUMA node constraint: - fastpath fails (c->slab is NULL) - call __slab_alloc() - slub_get_cpu_ptr() (disables preemption) - enter ___slab_alloc() - c->slab is NULL: goto new_slab - slub_percpu_partial() is non-NULL - set c->slab to slub_percpu_partial(c) - [CORRUPT STATE: c->slab points to slab-1, c->freelist has objects from slab-2] - goto redo - node_match() fails - goto deactivate_slab - existing c->freelist is passed into deactivate_slab() - inuse count of slab-1 is decremented to account for object from slab-2 At this point, the inuse count of slab-1 is 1 lower than it should be. This means that if we free all allocated objects in slab-1 except for one, SLUB will think that slab-1 is completely unused, and may free its page, leading to use-after-free. Fixes: c17dda40a6a4e ("slub: Separate out kmem_cache_cpu processing from deactivate_slab") Fixes: 03e404af26dc2 ("slub: fast release on full slab") Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220608182205.2945720-1-jannh@google.com
2022-06-13mm/slub: Move the stackdepot related allocation out of IRQ-off section.Sebastian Andrzej Siewior1-7/+34
The set_track() invocation in free_debug_processing() is invoked with acquired slab_lock(). The lock disables interrupts on PREEMPT_RT and this forbids to allocate memory which is done in stack_depot_save(). Split set_track() into two parts: set_track_prepare() which allocate memory and set_track_update() which only performs the assignment of the trace data structure. Use set_track_prepare() before disabling interrupts. [ vbabka@suse.cz: make set_track() call set_track_update() instead of open-coded assignments ] Fixes: 5cf909c553e9e ("mm/slub: use stackdepot to save stack trace in objects") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/Yp9sqoUi4fVa5ExF@linutronix.de
2022-05-25Merge tag 'slab-for-5.19' of ↵Linus Torvalds1-66/+108
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Conversion of slub_debug stack traces to stackdepot, allowing more useful debugfs-based inspection for e.g. memory leak debugging. Allocation and free debugfs info now includes full traces and is sorted by the unique trace frequency. The stackdepot conversion was already attempted last year but reverted by ae14c63a9f20. The memory overhead (while not actually enabled on boot) has been meanwhile solved by making the large stackdepot allocation dynamic. The xfstest issues haven't been reproduced on current kernel locally nor in -next, so the slab cache layout changes that originally made that bug manifest were probably not the root cause. - Refactoring of dma-kmalloc caches creation. - Trivial cleanups such as removal of unused parameters, fixes and clarifications of comments. - Hyeonggon Yoo joins as a reviewer. * tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: MAINTAINERS: add myself as reviewer for slab mm/slub: remove unused kmem_cache_order_objects max mm: slab: fix comment for __assume_kmalloc_alignment mm: slab: fix comment for ARCH_KMALLOC_MINALIGN mm/slub: remove unneeded return value of slab_pad_check mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache() mm/slub: remove meaningless node check in ___slab_alloc() mm/slub: remove duplicate flag in allocate_slab() mm/slub: remove unused parameter in setup_object*() mm/slab.c: fix comments slab, documentation: add description of debugfs files for SLUB caches mm/slub: sort debugfs output by frequency of stack traces mm/slub: distinguish and print stack traces in debugfs files mm/slub: use stackdepot to save stack trace in objects mm/slub: move struct track init out of set_track() lib/stackdepot: allow requesting early initialization dynamically mm/slub, kunit: Make slub_kunit unaffected by user specified flags mm/slab: remove some unused functions
2022-05-23Merge branches 'slab/for-5.19/stackdepot' and 'slab/for-5.19/refactor' into ↵Vlastimil Babka1-46/+91
slab/for-linus
2022-05-02mm/slub: remove unused kmem_cache_order_objects maxMiaohe Lin1-2/+0
max field holds the largest slab order that was ever used for a slab cache. But it's unused now. Remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220429090545.33413-1-linmiaohe@huawei.com
2022-04-20mm/slub: remove unneeded return value of slab_pad_checkMiaohe Lin1-7/+5
The return value of slab_pad_check is never used. So we can make it return void now. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220419120352.37825-1-linmiaohe@huawei.com
2022-04-16mm, kfence: support kmem_dump_obj() for KFENCE objectsMarco Elver1-1/+1
Calling kmem_obj_info() via kmem_dump_obj() on KFENCE objects has been producing garbage data due to the object not actually being maintained by SLAB or SLUB. Fix this by implementing __kfence_obj_info() that copies relevant information to struct kmem_obj_info when the object was allocated by KFENCE; this is called by a common kmem_obj_info(), which also calls the slab/slub/slob specific variant now called __kmem_obj_info(). For completeness, kmem_dump_obj() now displays if the object was allocated by KFENCE. Link: https://lore.kernel.org/all/20220323090520.GG16885@xsang-OptiPlex-9020/ Link: https://lkml.kernel.org/r/20220406131558.3558585-1-elver@google.com Fixes: b89fb5ef0ce6 ("mm, kfence: insert KFENCE hooks for SLUB") Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reported-by: kernel test robot <oliver.sang@intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> [slab] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-04-13mm/slub: remove meaningless node check in ___slab_alloc()JaeSang Yoo1-1/+0
node_match() with node=NUMA_NO_NODE always returns 1. Duplicate check by goto statement is meaningless. Remove it. Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220409144239.2649257-1-jsyoo5b@gmail.com
2022-04-13mm/slub: remove duplicate flag in allocate_slab()Jiyoup Kim1-1/+1
In allocate_slab(), __GFP_NOFAIL flag is removed twice when trying higher-order allocation. Remove it. Signed-off-by: Jiyoup Kim <lakroforce@gmail.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220409150538.1264-1-lakroforce@gmail.com
2022-04-13mm/slub: remove unused parameter in setup_object*()JaeSang Yoo1-11/+8
setup_object_debug() and setup_object() has unused parameter, "struct slab *slab". Remove it. By the commit 3ec0974210fe ("SLUB: Simplify debug code"), setup_object_debug() were introduced to refactor previous code blocks in the setup_object(). Previous code used SlabDebug() to init_object() and init_tracking(). As the SlabDebug() takes "struct page *page" as argument, the setup_object_debug() checks flag of "struct kmem_cache *s" which doesn't require "struct page *page". As the struct page were changed into struct slab by commit bb192ed9aa719 ("mm/slub: Convert most struct page to struct slab by spatch"), but it's still unused parameter. Suggested-by: Ohhoon Kwon <ohkwon1043@gmail.com> Signed-off-by: JaeSang Yoo <jsyoo5b@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220411072534.3372768-1-jsyoo5b@gmail.com
2022-04-06mm/slub: sort debugfs output by frequency of stack tracesOliver Glitta1-0/+16
Sort the output of debugfs alloc_traces and free_traces by the frequency of allocation/freeing stack traces. Most frequently used stack traces will be printed first, e.g. for easier memory leak debugging. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
2022-04-06mm/slub: distinguish and print stack traces in debugfs filesOliver Glitta1-2/+26
Aggregate objects in slub cache by unique stack trace in addition to caller address when producing contents of debugfs files alloc_traces and free_traces in debugfs. Also add the stack traces to the debugfs output. This makes it much more useful to e.g. debug memory leaks. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
2022-04-06mm/slub: use stackdepot to save stack trace in objectsOliver Glitta1-31/+40
Many stack traces are similar so there are many similar arrays. Stackdepot saves each unique stack only once. Replace field addrs in struct track with depot_stack_handle_t handle. Use stackdepot to save stack trace. The benefits are smaller memory overhead and possibility to aggregate per-cache statistics in the following patch using the stackdepot handle instead of matching stacks manually. [ vbabka@suse.cz: rebase to 5.17-rc1 and adjust accordingly ] This was initially merged as commit 788691464c29 and reverted by commit ae14c63a9f20 due to several issues, that should now be fixed. The problem of unconditional memory overhead by stackdepot has been addressed by commit 2dba5eb1c73b ("lib/stackdepot: allow optional init and stack_table allocation by kvmalloc()"), so the dependency on stackdepot will result in extra memory usage only when a slab cache tracking is actually enabled, and not for all CONFIG_SLUB_DEBUG builds. The build failures on some architectures were also addressed, and the reported issue with xfs/433 test did not reproduce on 5.17-rc1 with this patch. Signed-off-by: Oliver Glitta <glittao@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
2022-04-06mm/slub: move struct track init out of set_track()Vlastimil Babka1-17/+15
set_track() either zeroes out the struct track or fills it, depending on the addr parameter. This is unnecessary as there's only one place that calls it for the initialization - init_tracking(). We can simply do the zeroing there, with a single memset() that covers both TRACK_ALLOC and TRACK_FREE as they are adjacent. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-and-tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Acked-by: David Rientjes <rientjes@google.com>
2022-04-06mm/slub, kunit: Make slub_kunit unaffected by user specified flagsHyeonggon Yoo1-0/+3
slub_kunit does not expect other debugging flags to be set when running tests. When SLAB_RED_ZONE flag is set globally, test fails because the flag affects number of errors reported. To make slub_kunit unaffected by user specified debugging flags, introduce SLAB_NO_USER_FLAGS to ignore them. With this flag, only flags specified in the code are used and others are ignored. Suggested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/Yk0sY9yoJhFEXWOg@hyeyoo
2022-03-23Merge tag 'slab-for-5.18' of ↵Linus Torvalds1-78/+52
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - A few non-trivial SLUB code cleanups, most notably a refactoring of deactivate_slab(). - A bunch of trivial changes, such as removal of unused parameters, making stuff static, and employing helper functions. * tag 'slab-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm: slub: Delete useless parameter of alloc_slab_page() mm: slab: Delete unused SLAB_DEACTIVATED flag mm/slub: remove forced_order parameter in calculate_sizes mm/slub: refactor deactivate_slab() mm/slub: limit number of node partial slabs only in cache creation mm/slub: use helper macro __ATTR_XX_MODE for SLAB_ATTR(_RO) mm/slab_common: use helper function is_power_of_2() mm/slob: make kmem_cache_boot static
2022-03-23mm: introduce kmem_cache_alloc_lruMuchun Song1-14/+28
We currently allocate scope for every memcg to be able to tracked on every superblock instantiated in the system, regardless of whether that superblock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confined to just a small subset of the total number of superblocks that instantiated at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that adding the memcg to the list_lru at the first insert. So introduce kmem_cache_alloc_lru to allocate objects and its list_lru. In the later patch, we will convert all inode and dentry allocation from kmem_cache_alloc to kmem_cache_alloc_lru. Link: https://lkml.kernel.org/r/20220228122126.37293-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-21Merge branch 'slab/for-5.18/cleanups' into slab/for-linusVlastimil Babka1-63/+42
Non-trivial SLUB code cleanups, notably refactoring of deactivate_slab().
2022-03-10mm: slub: Delete useless parameter of alloc_slab_page()Xiongwei Song1-4/+4
The parameter @s is useless for alloc_slab_page(). It was added in 2014 by commit 5dfb41750992 ("sl[au]b: charge slabs to kmemcg explicitly"). The need for it was removed in 2020 by commit 1f3147b49d75 ("mm: slub: call account_slab_page() after slab page initialization"). Let's delete it. [willy@infradead.org: Added detailed history of @s] Signed-off-by: Xiongwei Song <sxwjean@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220310140701.87908-3-sxwjean@me.com
2022-03-09mm/slub: remove forced_order parameter in calculate_sizesMiaohe Lin1-7/+4
Since commit 32a6f409b693 ("mm, slub: remove runtime allocation order changes"), forced_order is always -1. Remove this unneeded parameter to simplify the code. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220309092036.50844-1-linmiaohe@huawei.com
2022-03-09mm/slub: refactor deactivate_slab()Hyeonggon Yoo1-52/+39
Simplify deactivate_slab() by unlocking n->list_lock and retrying cmpxchg_double() when cmpxchg_double() fails, and perform add_{partial,full} only when it succeed. Releasing and taking n->list_lock again here is not harmful as SLUB avoids deactivating slabs as much as possible. [ vbabka@suse.cz: perform add_{partial,full} when cmpxchg_double() succeed. count deactivating full slabs even if debugging flag is not set. ] Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Link: https://lore.kernel.org/r/20220307074057.902222-3-42.hyeyoo@gmail.com