From 4f745def815d86eece3614aa0cd10a25cf94fed4 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:00 -0500 Subject: drivers/base/node: optimize memory block registration to reduce boot time MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Patch series "drivers/base/node.c: optimization and cleanups", v7. This patch (of 7) During node device initialization, `memory blocks` are registered under each NUMA node. The `memory blocks` to be registered are identified using the node's start and end PFNs, which are obtained from the node's pg_data However, not all PFNs within this range necessarily belong to the same node—some may belong to other nodes. Additionally, due to the discontiguous nature of physical memory, certain sections within a `memory block` may be absent. As a result, `memory blocks` that fall between a node's start and end PFNs may span across multiple nodes, and some sections within those blocks may be missing. `Memory blocks` have a fixed size, which is architecture dependent. Due to these considerations, the memory block registration is currently performed as follows: for_each_online_node(nid): start_pfn = pgdat->node_start_pfn; end_pfn = pgdat->node_start_pfn + node_spanned_pages; for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn)) mem_blk = memory_block_id(pfn_to_section_nr(pfn)); pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr) pfn_mb_end = pfn_start + memory_block_pfns - 1 for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++): if (get_nid_for_pfn(pfn) != nid): continue; else do_register_memory_block_under_node(nid, mem_blk, MEMINIT_EARLY); Here, we derive the start and end PFNs from the node's pg_data, then determine the memory blocks that may belong to the node. For each `memory block` in this range, we inspect all PFNs it contains and check their associated NUMA node ID. If a PFN within the block matches the current node, the memory block is registered under that node. If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs a binary search in the `memblock regions` to determine the NUMA node ID for a given PFN. If it is not enabled, the node ID is retrieved directly from the struct page. On large systems, this process can become time-consuming, especially since we iterate over each `memory block` and all PFNs within it until a match is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the additional overhead of the binary search increases the execution time significantly, potentially leading to soft lockups during boot. In this patch, we iterate over `memblock region` to identify the `memory blocks` that belong to the current NUMA node. `memblock regions` are contiguous memory ranges, each associated with a single NUMA node, and they do not span across multiple nodes. for_each_memory_region(r): // r => region if (!node_online(r->nid)): continue; else for_each_memory_block_between(r->base, r->base + r->size - 1): do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY); We iterate over all memblock regions, and if the node associated with the region is online, we calculate the start and end memory blocks based on the region's start and end PFNs. We then register all the memory blocks within that range under the region node. Test Results on My system with 32TB RAM ======================================= 1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled. Without this patch ------------------ Startup finished in 1min 16.528s (kernel) With this patch --------------- Startup finished in 17.236s (kernel) - 78% Improvement 2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled. Without this patch ------------------ Startup finished in 28.320s (kernel) With this patch --------------- Startup finished in 15.621s (kernel) - 46% Improvement [donettom@linux.ibm.com: restore removed extra line] Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: David Hildenbrand Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: Zi Yan Signed-off-by: Andrew Morton --- include/linux/node.h | 3 +++ 1 file changed, 3 insertions(+) (limited to 'include/linux/node.h') diff --git a/include/linux/node.h b/include/linux/node.h index 2b7517892230..485370f3bc17 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -120,6 +120,9 @@ static inline void register_memory_blocks_under_node(int nid, unsigned long star enum meminit_context context) { } +static inline void register_memory_blocks_under_nodes(void) +{ +} #endif extern void unregister_node(struct node *node); -- cgit v1.2.3 From ac24f6cd87d88150fc6c1fef904794571f62dc5e Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:02 -0500 Subject: drivers/base/node: remove register_memory_blocks_under_node() function call from register_one_node MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit register_one_node() is now only called via cpu_up() → __try_online_node() during CPU hotplug operations to online a node. At this stage, the node has not yet had any memory added. As a result, there are no memory blocks to walk or register, so calling register_memory_blocks_under_node() is unnecessary. Therefore, the call to register_memory_blocks_under_node() has been removed from register_one_node(). Link: https://lkml.kernel.org/r/ecf07075b1a41015fcf58823997d5c2ed7b8c18f.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Cc: Zi Yan Signed-off-by: Andrew Morton --- include/linux/node.h | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) (limited to 'include/linux/node.h') diff --git a/include/linux/node.h b/include/linux/node.h index 485370f3bc17..b15de78e0408 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -134,21 +134,7 @@ extern int __register_one_node(int nid); /* Registers an online node */ static inline int register_one_node(int nid) { - int error = 0; - - if (node_online(nid)) { - struct pglist_data *pgdat = NODE_DATA(nid); - unsigned long start_pfn = pgdat->node_start_pfn; - unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages; - - error = __register_one_node(nid); - if (error) - return error; - register_memory_blocks_under_node(nid, start_pfn, end_pfn, - MEMINIT_EARLY); - } - - return error; + return __register_one_node(nid); } extern void unregister_one_node(int nid); -- cgit v1.2.3 From 10f09d82f8b7c25bbd0b6d5142ff6df6e634132d Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:03 -0500 Subject: drivers/base/node: rename register_memory_blocks_under_node() and remove context argument The function register_memory_blocks_under_node() is now only called from the memory hotplug path, as register_memory_blocks_under_node_early() handles registration during early boot. Therefore, the context argument used to differentiate between early boot and hotplug is no longer needed and was removed. Since the function is only called from the hotplug path, we renamed register_memory_blocks_under_node() to register_memory_blocks_under_node_hotplug() Link: https://lkml.kernel.org/r/907c22292b0ee4975107876efc875c75c11badd9.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: Oscar Salvador Acked-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand Acked-by: Zi Yan Signed-off-by: Andrew Morton --- drivers/base/node.c | 5 ++--- include/linux/node.h | 11 +++++------ mm/memory_hotplug.c | 5 ++--- 3 files changed, 9 insertions(+), 12 deletions(-) (limited to 'include/linux/node.h') diff --git a/drivers/base/node.c b/drivers/base/node.c index 24c32cdbe97d..302cb99853dd 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -839,9 +839,8 @@ static void register_memory_blocks_under_nodes(void) } } -void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context) +void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, + unsigned long end_pfn) { walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn), (void *)&nid, register_mem_block_under_node_hotplug); diff --git a/include/linux/node.h b/include/linux/node.h index b15de78e0408..75b036a100d2 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -111,13 +111,12 @@ struct memory_block; extern struct node *node_devices[]; #if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) -void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context); +void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, + unsigned long end_pfn); #else -static inline void register_memory_blocks_under_node(int nid, unsigned long start_pfn, - unsigned long end_pfn, - enum meminit_context context) +static inline void register_memory_blocks_under_node_hotplug(int nid, + unsigned long start_pfn, + unsigned long end_pfn) { } static inline void register_memory_blocks_under_nodes(void) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b1caedbade5b..c6ce1aba64e6 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1575,9 +1575,8 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) BUG_ON(ret); } - register_memory_blocks_under_node(nid, PFN_DOWN(start), - PFN_UP(start + size - 1), - MEMINIT_HOTPLUG); + register_memory_blocks_under_node_hotplug(nid, PFN_DOWN(start), + PFN_UP(start + size - 1)); /* create new memmap entry */ if (!strcmp(res->name, "System RAM")) -- cgit v1.2.3 From a5352f8a40a8d1b385abeca0b2cff5d2468e31a1 Mon Sep 17 00:00:00 2001 From: Donet Tom Date: Wed, 28 May 2025 12:18:04 -0500 Subject: drivers/base/node: rename __register_one_node() to register_one_node() The register_one_node() function was a simple wrapper around __register_one_node(). To simplify the code, register_one_node() has been removed, and __register_one_node() has been renamed to register_one_node(). Link: https://lkml.kernel.org/r/8262cd0f44eeb048a1fcd3ac8382760d7f7dea60.1748452242.git.donettom@linux.ibm.com Signed-off-by: Donet Tom Acked-by: David Hildenbrand Cc: Mike Rapoport (Microsoft) Cc: Oscar Salvador Cc: Zi Yan Signed-off-by: Andrew Morton --- arch/powerpc/platforms/pseries/pci_dlpar.c | 2 +- drivers/base/node.c | 4 ++-- include/linux/node.h | 13 +------------ mm/memory_hotplug.c | 2 +- 4 files changed, 5 insertions(+), 16 deletions(-) (limited to 'include/linux/node.h') diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c index 52e2623a741d..aeb8633a3d00 100644 --- a/arch/powerpc/platforms/pseries/pci_dlpar.c +++ b/arch/powerpc/platforms/pseries/pci_dlpar.c @@ -29,7 +29,7 @@ struct pci_controller *init_phb_dynamic(struct device_node *dn) nid = of_node_to_nid(dn); if (likely((nid) >= 0)) { if (!node_online(nid)) { - if (__register_one_node(nid)) { + if (register_one_node(nid)) { pr_err("PCI: Failed to register node %d\n", nid); } else { update_numa_distance(dn); diff --git a/drivers/base/node.c b/drivers/base/node.c index 302cb99853dd..6cbeca45c451 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -848,7 +848,7 @@ void register_memory_blocks_under_node_hotplug(int nid, unsigned long start_pfn, } #endif /* CONFIG_MEMORY_HOTPLUG */ -int __register_one_node(int nid) +int register_one_node(int nid) { int error; int cpu; @@ -955,7 +955,7 @@ void __init node_dev_init(void) * to already created cpu devices. */ for_each_online_node(i) { - ret = __register_one_node(i); + ret = register_one_node(i); if (ret) panic("%s() failed to add node: %d\n", __func__, ret); } diff --git a/include/linux/node.h b/include/linux/node.h index 75b036a100d2..88bceebcbfa5 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -128,14 +128,7 @@ extern void unregister_node(struct node *node); #ifdef CONFIG_NUMA extern void node_dev_init(void); /* Core of the node registration - only memory hotplug should use this */ -extern int __register_one_node(int nid); - -/* Registers an online node */ -static inline int register_one_node(int nid) -{ - return __register_one_node(nid); -} - +extern int register_one_node(int nid); extern void unregister_one_node(int nid); extern int register_cpu_under_node(unsigned int cpu, unsigned int nid); extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid); @@ -148,10 +141,6 @@ extern int register_memory_node_under_compute_node(unsigned int mem_nid, static inline void node_dev_init(void) { } -static inline int __register_one_node(int nid) -{ - return 0; -} static inline int register_one_node(int nid) { return 0; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index c6ce1aba64e6..bec20a91e757 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1571,7 +1571,7 @@ int add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) * We online node here. We can't roll back from here. */ node_set_online(nid); - ret = __register_one_node(nid); + ret = register_one_node(nid); BUG_ON(ret); } -- cgit v1.2.3 From 67929de108479dbb78496b61af5c24072fc16d8d Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Mon, 16 Jun 2025 15:51:46 +0200 Subject: mm,memory_hotplug: implement numa node notifier There are at least six consumers of hotplug_memory_notifier that what they really are interested in is whether any numa node changed its state, e.g: going from having memory to not having memory and vice versa. Implement a specific notifier for numa nodes when their state gets changed, which will later be used by those consumers that are only interested in numa node state changes. Add documentation as well. [dan.carpenter@linaro.org: set failure reason in offline_pages()] Link: https://lkml.kernel.org/r/be4fd31b-7d09-46b0-8329-6d0464ffa7a5@sabinyo.mountain Link: https://lkml.kernel.org/r/20250616135158.450136-4-osalvador@suse.de Signed-off-by: Oscar Salvador Signed-off-by: Dan Carpenter Reviewed-by: Jonathan Cameron Reviewed-by: Harry Yoo Reviewed-by: Vlastimil Babka Acked-by: David Hildenbrand Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Rakie Kim Signed-off-by: Andrew Morton --- Documentation/core-api/memory-hotplug.rst | 83 +++++++++++++++++ drivers/base/node.c | 21 +++++ include/linux/node.h | 40 ++++++++ mm/memory_hotplug.c | 146 ++++++++++++++---------------- 4 files changed, 210 insertions(+), 80 deletions(-) (limited to 'include/linux/node.h') diff --git a/Documentation/core-api/memory-hotplug.rst b/Documentation/core-api/memory-hotplug.rst index d1b8eb9add8a..fb84e78968b2 100644 --- a/Documentation/core-api/memory-hotplug.rst +++ b/Documentation/core-api/memory-hotplug.rst @@ -9,6 +9,9 @@ Memory hotplug event notifier Hotplugging events are sent to a notification queue. +Memory notifier +---------------- + There are six types of notification defined in ``include/linux/memory.h``: MEM_GOING_ONLINE @@ -68,6 +71,14 @@ The third argument (arg) passes a pointer of struct memory_notify:: If status_changed_nid* >= 0, callback should create/discard structures for the node if necessary. +It is possible to get notified for MEM_CANCEL_ONLINE without having been notified +for MEM_GOING_ONLINE, and the same applies to MEM_CANCEL_OFFLINE and +MEM_GOING_OFFLINE. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of memory_notify make no assumptions and get +prepared to handle such cases. + The callback routine shall return one of the values NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP defined in ``include/linux/notifier.h`` @@ -80,6 +91,78 @@ further processing of the notification queue. NOTIFY_STOP stops further processing of the notification queue. +Numa node notifier +------------------ + +There are six types of notification defined in ``include/linux/node.h``: + +NODE_ADDING_FIRST_MEMORY + Generated before memory becomes available to this node for the first time. + +NODE_CANCEL_ADDING_FIRST_MEMORY + Generated if NODE_ADDING_FIRST_MEMORY fails. + +NODE_ADDED_FIRST_MEMORY + Generated when memory has become available fo this node for the first time. + +NODE_REMOVING_LAST_MEMORY + Generated when the last memory available to this node is about to be offlined. + +NODE_CANCEL_REMOVING_LAST_MEMORY + Generated when NODE_CANCEL_REMOVING_LAST_MEMORY fails. + +NODE_REMOVED_LAST_MEMORY + Generated when the last memory available to this node has been offlined. + +A callback routine can be registered by calling:: + + hotplug_node_notifier(callback_func, priority) + +Callback functions with higher values of priority are called before callback +functions with lower values. + +A callback function must have the following prototype:: + + int callback_func( + + struct notifier_block *self, unsigned long action, void *arg); + +The first argument of the callback function (self) is a pointer to the block +of the notifier chain that points to the callback function itself. +The second argument (action) is one of the event types described above. +The third argument (arg) passes a pointer of struct node_notify:: + + struct node_notify { + int nid; + } + +- nid is the node we are adding or removing memory to. + +It is possible to get notified for NODE_CANCEL_ADDING_FIRST_MEMORY without +having been notified for NODE_ADDING_FIRST_MEMORY, and the same applies to +NODE_CANCEL_REMOVING_LAST_MEMORY and NODE_REMOVING_LAST_MEMORY. +This can happen when a consumer fails, meaning we break the callchain and we +stop calling the remaining consumers of the notifier. +It is then important that users of node_notify make no assumptions and get +prepared to handle such cases. + +The callback routine shall return one of the values +NOTIFY_DONE, NOTIFY_OK, NOTIFY_BAD, NOTIFY_STOP +defined in ``include/linux/notifier.h`` + +NOTIFY_DONE and NOTIFY_OK have no effect on the further processing. + +NOTIFY_BAD is used as response to the NODE_ADDING_FIRST_MEMORY, +NODE_REMOVING_LAST_MEMORY, NODE_ADDED_FIRST_MEMORY or +NODE_REMOVED_LAST_MEMORY action to cancel hotplugging. +It stops further processing of the notification queue. + +NOTIFY_STOP stops further processing of the notification queue. + +Please note that we should not fail for NODE_ADDED_FIRST_MEMORY / +NODE_REMOVED_FIRST_MEMORY, as memory_hotplug code cannot rollback at that +point anymore. + Locking Internals ================= diff --git a/drivers/base/node.c b/drivers/base/node.c index 6cbeca45c451..6d66382dae65 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -112,6 +112,27 @@ static const struct attribute_group *node_access_node_groups[] = { NULL, }; +#ifdef CONFIG_MEMORY_HOTPLUG +static BLOCKING_NOTIFIER_HEAD(node_chain); + +int register_node_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&node_chain, nb); +} +EXPORT_SYMBOL(register_node_notifier); + +void unregister_node_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&node_chain, nb); +} +EXPORT_SYMBOL(unregister_node_notifier); + +int node_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(&node_chain, val, v); +} +#endif + static void node_remove_accesses(struct node *node) { struct node_access_nodes *c, *cnext; diff --git a/include/linux/node.h b/include/linux/node.h index 88bceebcbfa5..2c7529335b21 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -125,6 +125,46 @@ static inline void register_memory_blocks_under_nodes(void) #endif extern void unregister_node(struct node *node); + +struct node_notify { + int nid; +}; + +#define NODE_ADDING_FIRST_MEMORY (1<<0) +#define NODE_ADDED_FIRST_MEMORY (1<<1) +#define NODE_CANCEL_ADDING_FIRST_MEMORY (1<<2) +#define NODE_REMOVING_LAST_MEMORY (1<<3) +#define NODE_REMOVED_LAST_MEMORY (1<<4) +#define NODE_CANCEL_REMOVING_LAST_MEMORY (1<<5) + +#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA) +extern int register_node_notifier(struct notifier_block *nb); +extern void unregister_node_notifier(struct notifier_block *nb); +extern int node_notify(unsigned long val, void *v); + +#define hotplug_node_notifier(fn, pri) ({ \ + static __meminitdata struct notifier_block fn##_node_nb =\ + { .notifier_call = fn, .priority = pri };\ + register_node_notifier(&fn##_node_nb); \ +}) +#else +static inline int register_node_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_node_notifier(struct notifier_block *nb) +{ +} +static inline int node_notify(unsigned long val, void *v) +{ + return 0; +} +static inline int hotplug_node_notifier(notifier_fn_t fn, int pri) +{ + return 0; +} +#endif + #ifdef CONFIG_NUMA extern void node_dev_init(void); /* Core of the node registration - only memory hotplug should use this */ diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d11278c3840d..0d54a01985ba 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -35,6 +35,7 @@ #include #include #include +#include #include @@ -699,24 +700,6 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) online_mem_sections(start_pfn, end_pfn); } -/* check which state of node_states will be changed when online memory */ -static void node_states_check_changes_online(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) -{ - int nid = zone_to_nid(zone); - - arg->status_change_nid = NUMA_NO_NODE; - - if (!node_state(nid, N_MEMORY)) - arg->status_change_nid = nid; -} - -static void node_states_set_node(int node, struct memory_notify *arg) -{ - if (arg->status_change_nid >= 0) - node_set_state(node, N_MEMORY); -} - static void __meminit resize_zone_range(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages) { @@ -1167,11 +1150,18 @@ void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages) int online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { - unsigned long flags; - int need_zonelists_rebuild = 0; + struct memory_notify mem_arg = { + .start_pfn = pfn, + .nr_pages = nr_pages, + .status_change_nid = NUMA_NO_NODE, + }; + struct node_notify node_arg = { + .nid = NUMA_NO_NODE, + }; const int nid = zone_to_nid(zone); + int need_zonelists_rebuild = 0; + unsigned long flags; int ret; - struct memory_notify arg; /* * {on,off}lining is constrained to full memory sections (or more @@ -1188,11 +1178,17 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, /* associate pfn range with the zone */ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); - arg.start_pfn = pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_online(nr_pages, zone, &arg); + if (!node_state(nid, N_MEMORY)) { + /* Adding memory to the node for the first time */ + node_arg.nid = nid; + mem_arg.status_change_nid = nid; + ret = node_notify(NODE_ADDING_FIRST_MEMORY, &node_arg); + ret = notifier_to_errno(ret); + if (ret) + goto failed_addition; + } - ret = memory_notify(MEM_GOING_ONLINE, &arg); + ret = memory_notify(MEM_GOING_ONLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) goto failed_addition; @@ -1218,7 +1214,8 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); - node_states_set_node(nid, &arg); + if (node_arg.nid >= 0) + node_set_state(nid, N_MEMORY); if (need_zonelists_rebuild) build_all_zonelists(NULL); @@ -1239,16 +1236,22 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, kswapd_run(nid); kcompactd_run(nid); + if (node_arg.nid >= 0) + /* First memory added successfully. Notify consumers. */ + node_notify(NODE_ADDED_FIRST_MEMORY, &node_arg); + writeback_set_ratelimit(); - memory_notify(MEM_ONLINE, &arg); + memory_notify(MEM_ONLINE, &mem_arg); return 0; failed_addition: pr_debug("online_pages [mem %#010llx-%#010llx] failed\n", (unsigned long long) pfn << PAGE_SHIFT, (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); - memory_notify(MEM_CANCEL_ONLINE, &arg); + memory_notify(MEM_CANCEL_ONLINE, &mem_arg); + if (node_arg.nid != NUMA_NO_NODE) + node_notify(NODE_CANCEL_ADDING_FIRST_MEMORY, &node_arg); remove_pfn_range_from_zone(zone, pfn, nr_pages); return ret; } @@ -1879,48 +1882,6 @@ static int __init cmdline_parse_movable_node(char *p) } early_param("movable_node", cmdline_parse_movable_node); -/* check which state of node_states will be changed when offline memory */ -static void node_states_check_changes_offline(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) -{ - struct pglist_data *pgdat = zone->zone_pgdat; - unsigned long present_pages = 0; - enum zone_type zt; - - arg->status_change_nid = NUMA_NO_NODE; - - /* - * Check whether node_states[N_NORMAL_MEMORY] will be changed. - * If the memory to be offline is within the range - * [0..ZONE_NORMAL], and it is the last present memory there, - * the zones in that range will become empty after the offlining, - * thus we can determine that we need to clear the node from - * node_states[N_NORMAL_MEMORY]. - */ - for (zt = 0; zt <= ZONE_NORMAL; zt++) - present_pages += pgdat->node_zones[zt].present_pages; - - /* - * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM - * does not apply as we don't support 32bit. - * Here we count the possible pages from ZONE_MOVABLE. - * If after having accounted all the pages, we see that the nr_pages - * to be offlined is over or equal to the accounted pages, - * we know that the node will become empty, and so, we can clear - * it for N_MEMORY as well. - */ - present_pages += pgdat->node_zones[ZONE_MOVABLE].present_pages; - - if (nr_pages >= present_pages) - arg->status_change_nid = zone_to_nid(zone); -} - -static void node_states_clear_node(int node, struct memory_notify *arg) -{ - if (arg->status_change_nid >= 0) - node_clear_state(node, N_MEMORY); -} - static int count_system_ram_pages_cb(unsigned long start_pfn, unsigned long nr_pages, void *data) { @@ -1936,11 +1897,19 @@ static int count_system_ram_pages_cb(unsigned long start_pfn, int offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group) { - const unsigned long end_pfn = start_pfn + nr_pages; unsigned long pfn, managed_pages, system_ram_pages = 0; + const unsigned long end_pfn = start_pfn + nr_pages; + struct pglist_data *pgdat = zone->zone_pgdat; const int node = zone_to_nid(zone); + struct memory_notify mem_arg = { + .start_pfn = start_pfn, + .nr_pages = nr_pages, + .status_change_nid = NUMA_NO_NODE, + }; + struct node_notify node_arg = { + .nid = NUMA_NO_NODE, + }; unsigned long flags; - struct memory_notify arg; char *reason; int ret; @@ -1999,11 +1968,23 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, goto failed_removal_pcplists_disabled; } - arg.start_pfn = start_pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_offline(nr_pages, zone, &arg); + /* + * Check whether the node will have no present pages after we offline + * 'nr_pages' more. If so, we know that the node will become empty, and + * so we will clear N_MEMORY for it. + */ + if (nr_pages >= pgdat->node_present_pages) { + node_arg.nid = node; + mem_arg.status_change_nid = node; + ret = node_notify(NODE_REMOVING_LAST_MEMORY, &node_arg); + ret = notifier_to_errno(ret); + if (ret) { + reason = "node notifier failure"; + goto failed_removal_isolated; + } + } - ret = memory_notify(MEM_GOING_OFFLINE, &arg); + ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) { reason = "notifier failure"; @@ -2083,27 +2064,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, * Make sure to mark the node as memory-less before rebuilding the zone * list. Otherwise this node would still appear in the fallback lists. */ - node_states_clear_node(node, &arg); + if (node_arg.nid >= 0) + node_clear_state(node, N_MEMORY); if (!populated_zone(zone)) { zone_pcp_reset(zone); build_all_zonelists(NULL); } - if (arg.status_change_nid >= 0) { + if (node_arg.nid >= 0) { kcompactd_stop(node); kswapd_stop(node); + /* Node went memoryless. Notify consumers */ + node_notify(NODE_REMOVED_LAST_MEMORY, &node_arg); } writeback_set_ratelimit(); - memory_notify(MEM_OFFLINE, &arg); + memory_notify(MEM_OFFLINE, &mem_arg); remove_pfn_range_from_zone(zone, start_pfn, nr_pages); return 0; failed_removal_isolated: /* pushback to free area */ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); - memory_notify(MEM_CANCEL_OFFLINE, &arg); + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); + if (node_arg.nid != NUMA_NO_NODE) + node_notify(NODE_CANCEL_REMOVING_LAST_MEMORY, &node_arg); failed_removal_pcplists_disabled: lru_cache_enable(); zone_pcp_enable(zone); -- cgit v1.2.3