kernel/linux.git/mm/mm_init.c, branch v6.19.11

mm/mm_init: don't cond_resched() in deferred_init_memmap_chunk() if called from deferred_grow_zone()

2026-01-27T03:03:48+00:00

Commit 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") made deferred_grow_zone() call deferred_init_memmap_chunk() within a pgdat_resize_lock() critical section with irqs disabled. It did check for irqs_disabled() in deferred_init_memmap_chunk() to avoid calling cond_resched(). For a PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable interrupt but rcu_read_lock() is called. This leads to the following bug report. BUG: sleeping function called from invalid context at mm/mm_init.c:2091 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 3 locks held by swapper/0/1: #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40 #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278 #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408 CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.19.0-rc6-test #1 PREEMPT_{RT,(full) } Tainted: [W]=WARN Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0xdc/0xf8 dump_stack+0x1c/0x28 __might_resched+0x384/0x530 deferred_init_memmap_chunk+0x560/0x688 deferred_grow_zone+0x190/0x278 _deferred_grow_zone+0x18/0x30 get_page_from_freelist+0x780/0xf78 __alloc_frozen_pages_noprof+0x1dc/0x348 alloc_slab_page+0x30/0x110 allocate_slab+0x98/0x2a0 new_slab+0x4c/0x80 ___slab_alloc+0x5a4/0x770 __slab_alloc.constprop.0+0x88/0x1e0 __kmalloc_node_noprof+0x2c0/0x598 __sdt_alloc+0x3b8/0x728 build_sched_domains+0xe0/0x1260 sched_init_domains+0x14c/0x1c8 sched_init_smp+0x9c/0x1d0 kernel_init_freeable+0x218/0x358 kernel_init+0x28/0x208 ret_from_fork+0x10/0x20 Fix it adding a new argument to deferred_init_memmap_chunk() to explicitly tell it if cond_resched() is allowed or not instead of relying on some current state information which may vary depending on the exact kernel configuration options that are enabled. Link: https://lkml.kernel.org/r/20260122184343.546627-1-longman@redhat.com Fixes: 3acb913c9d5b ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") Signed-off-by: Waiman Long Suggested-by: Sebastian Andrzej Siewior Reviewed-by: Sebastian Andrzej Siewior Reviewed-by: Mike Rapoport (Microsoft) Cc: David Hildenbrand Cc: "Paul E . McKenney" Cc: Steven Rostedt Cc: Wei Yang Cc: Signed-off-by: Andrew Morton

Merge tag 'memblock-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

2025-12-07T16:56:10+00:00

Pull memblock update from Mike Rapoport: "Introduce a 'check_pages' boot parameter to decouple simple checks for page state on allocation and free from CONFIG_DEBUG_VM. This allows enabling page checking without building kernel with CONFIG_DEBUG_VM or forcing init_on_{alloc, free} or other heavier mechanisms" * tag 'memblock-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: mm/mm_init: Introduce a boot parameter for check_pages

mm/mm_init: Introduce a boot parameter for check_pages

2025-12-04T17:40:25+00:00

Use-after-free and double-free bugs can be very difficult to track down. The kernel is good at tracking these and preventing bad pages from being used/created through simple checks gated behind "check_pages_enabled". Currently, the only ways to enable this flag is by building with CONFIG_DEBUG_VM, or as a side effect of other checks such as init_on_{alloc, free}, page_poisoning, or debug_pagealloc among others. These solutions are powerful, but may often be too coarse in balancing the performance vs. safety that a user may want, particularly in latency-sensitive production environments. Introduce a new boot parameter "check_pages", which enables page checking with no other side effects. It takes kstrbool-able inputs as an argument (i.e. 0/1, true/false, on/off, ...). This patch is backwards-compatible; setting CONFIG_DEBUG_VM still enables page checking. Acked-by: SeongJae Park Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Anshuman Khandual Signed-off-by: Joshua Hahn Link: https://patch.msgid.link/20251201180739.2330474-1-joshua.hahnjy@gmail.com Signed-off-by: Mike Rapoport (Microsoft)

drivers/base/node: fold register_node() into register_one_node()

2025-11-17T01:28:02+00:00

Patch series "drivers/base/node: fold node register and unregister functions", v2. The first patch merges register_one_node() and register_node(), leaving a single register_node() function. The second patch merges unregister_one_node() and unregister_node(), leaving a single unregister_node() function. There are no functional changes in these patches. This patch (of 2): register_node() is only called from register_one_node(). This patch folds register_node() into its only caller and renames register_one_node() to register_node(). This reduces unnecessary indirection and simplifies the code structure. No functional changes are introduced. [akpm@linux-foundation.org: fix kerneldoc, per David] Link: https://lkml.kernel.org/r/cover.1760097207.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/910853c9dd61f7a2190a56cba101e73e9c6859be.1760097207.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Mike Rapoport (Microsoft) Acked-by: SeongJae Park Acked-by: David Hildenbrand Cc: Aboorva Devarajan Cc: Christophe Leroy Cc: Danilo Krummrich Cc: Dave Jiang Cc: David Hildenbrand Cc: Greg Kroah-Hartman Cc: Ingo Molnar Cc: Madhavan Srinivasan Cc: Oscar Salvador Cc: Peter Zijlstra Cc: "Ritesh Harjani (IBM)" Signed-off-by: Andrew Morton

mm/mm_init: fix hash table order logging in alloc_large_system_hash()

2025-11-10T05:19:44+00:00

When emitting the order of the allocation for a hash table, alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log base 2 of the allocation size. This is not correct if the allocation size is smaller than a page, and yields a negative value for the order as seen below: TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP bind hash table entries: 32 (order: -2, 1024 bytes, linear) Use get_order() to compute the order when emitting the hash table information to correctly handle cases where the allocation size is smaller than a page: TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP bind hash table entries: 32 (order: 0, 1024 bytes, linear) Link: https://lkml.kernel.org/r/20251028191020.413002-1-isaacmanjarres@google.com Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Isaac J. Manjarres Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand Cc: Signed-off-by: Andrew Morton

Merge tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

2025-10-04T18:03:10+00:00

Pull mm-init update from Mike Rapoport: "Simplify deferred initialization of struct pages Refactor and simplify deferred initialization of the memory map. Beside the negative diffstat it gives 3ms (55ms vs 58ms) reduction in the initialization of deferred pages on single node system with 64GiB of RAM" * tag 'memblock-v6.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: drop for_each_free_mem_pfn_range_in_zone_from() mm/mm_init: drop deferred_init_maxorder() mm/mm_init: deferred_init_memmap: use a job per zone mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()

mm/mm_init: make memmap_init_compound() look more like prep_compound_page()

2025-09-21T21:22:03+00:00

Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized. Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily. Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does. No need for a comment to describe the initialization order: again, just like prep_compound_page(). Link: https://lkml.kernel.org/r/20250901150359.867252-10-david@redhat.com Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Wei Yang Reviewed-by: Lorenzo Stoakes Acked-by: Liam R. Howlett Signed-off-by: Andrew Morton

mm/mm_init: drop deferred_init_maxorder()

2025-09-14T05:48:59+00:00

deferred_init_memmap_chunk() calls deferred_init_maxorder() to initialize struct pages in MAX_ORDER_NR_PAGES because according to commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections") this provides better cache locality than initializing the memory map in larger sections. The looping through free memory ranges is quite cumbersome in the current implementation as it is divided between deferred_init_memmap_chunk() and deferred_init_maxorder(). Besides, the latter has two loops, one that initializes struct pages and another one that frees them. There is no need in two loops because it is safe to free pages in groups smaller than MAX_ORDER_NR_PAGES. Even if lookup for a buddy page will access a struct page ahead of the pages being initialized, that page is guaranteed to be initialized either by memmap_init_reserved_pages() or by init_unavailable_range(). Simplify the code by moving initialization and freeing of the pages into deferred_init_memmap_chunk() and dropping deferred_init_maxorder(). Reviewed-by: David Hildenbrand Reviewed-by: Wei Yang Signed-off-by: Mike Rapoport (Microsoft)

mm/mm_init: deferred_init_memmap: use a job per zone

2025-09-14T05:48:55+00:00

deferred_init_memmap() loops over free memory ranges and creates a padata_mt_job for every free range that intersects with the zone being initialized. padata_do_multithreaded() then splits every such range to several chunks and runs a thread that initializes struct pages in that chunk using deferred_init_memmap_chunk(). The number of threads is limited by amount of the CPUs on the node (or 1 for memoryless nodes). Looping through free memory ranges is then repeated in deferred_init_memmap_chunk() first to find the first range that should be initialized and then to traverse the ranges until the end of the chunk is reached. Remove the loop over free memory regions in deferred_init_memmap() and pass the entire zone to padata_do_multithreaded() so that it will be divided to several chunks by the parallelization code. Reviewed-by: David Hildenbrand Reviewed-by: Wei Yang Signed-off-by: Mike Rapoport (Microsoft)

mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()

2025-09-14T05:48:48+00:00

deferred_grow_zone() initializes one or more sections in the memory map if buddy runs out of initialized struct pages when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled. It loops through memblock regions and initializes and frees pages in MAX_ORDER_NR_PAGES chunks. Essentially the same loop is implemented in deferred_init_memmap_chunk(), the only actual difference is that deferred_init_memmap_chunk() does not count initialized pages. Make deferred_init_memmap_chunk() count the initialized pages and return their number, wrap it with deferred_init_memmap_job() for multithreaded initialization with padata_do_multithreaded() and replace open-coded initialization of struct pages in deferred_grow_zone() with a call to deferred_init_memmap_chunk(). Reviewed-by: David Hildenbrand Reviewed-by: Wei Yang Signed-off-by: Mike Rapoport (Microsoft)