Skip to content

Commit 19d5f2c

Browse files
osalvadorvilardagapaniakin-aws
authored andcommitted
mm,memory_hotplug: allocate memmap from the added memory range
Physical memory hotadd has to allocate a memmap (struct page array) for the newly added memory section. Currently, alloc_pages_node() is used for those allocations. This has some disadvantages: a) an existing memory is consumed for that purpose (eg: ~2MB per 128MB memory section on x86_64) This can even lead to extreme cases where system goes OOM because the physically hotplugged memory depletes the available memory before it is onlined. b) if the whole node is movable then we have off-node struct pages which has performance drawbacks. c) It might be there are no PMD_ALIGNED chunks so memmap array gets populated with base pages. This can be improved when CONFIG_SPARSEMEM_VMEMMAP is enabled. Vmemap page tables can map arbitrary memory. That means that we can reserve a part of the physically hotadded memory to back vmemmap page tables. This implementation uses the beginning of the hotplugged memory for that purpose. There are some non-obviously things to consider though. Vmemmap pages are allocated/freed during the memory hotplug events (add_memory_resource(), try_remove_memory()) when the memory is added/removed. This means that the reserved physical range is not online although it is used. The most obvious side effect is that pfn_to_online_page() returns NULL for those pfns. The current design expects that this should be OK as the hotplugged memory is considered a garbage until it is onlined. For example hibernation wouldn't save the content of those vmmemmaps into the image so it wouldn't be restored on resume but this should be OK as there no real content to recover anyway while metadata is reachable from other data structures (e.g. vmemmap page tables). The reserved space is therefore (de)initialized during the {on,off}line events (mhp_{de}init_memmap_on_memory). That is done by extracting page allocator independent initialization from the regular onlining path. The primary reason to handle the reserved space outside of {on,off}line_pages is to make each initialization specific to the purpose rather than special case them in a single function. As per above, the functions that are introduced are: - mhp_init_memmap_on_memory: Initializes vmemmap pages by calling move_pfn_range_to_zone(), calls kasan_add_zero_shadow(), and onlines as many sections as vmemmap pages fully span. - mhp_deinit_memmap_on_memory: Offlines as many sections as vmemmap pages fully span, removes the range from zhe zone by remove_pfn_range_from_zone(), and calls kasan_remove_zero_shadow() for the range. The new function memory_block_online() calls mhp_init_memmap_on_memory() before doing the actual online_pages(). Should online_pages() fail, we clean up by calling mhp_deinit_memmap_on_memory(). Adjusting of present_pages is done at the end once we know that online_pages() succedeed. On offline, memory_block_offline() needs to unaccount vmemmap pages from present_pages() before calling offline_pages(). This is necessary because offline_pages() tears down some structures based on the fact whether the node or the zone become empty. If offline_pages() fails, we account back vmemmap pages. If it succeeds, we call mhp_deinit_memmap_on_memory(). Hot-remove: We need to be careful when removing memory, as adding and removing memory needs to be done with the same granularity. To check that this assumption is not violated, we check the memory range we want to remove and if a) any memory block has vmemmap pages and b) the range spans more than a single memory block, we scream out loud and refuse to proceed. If all is good and the range was using memmap on memory (aka vmemmap pages), we construct an altmap structure so free_hugepage_table does the right thing and calls vmem_altmap_free instead of free_pagetable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oscar Salvador <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Pavel Tatashin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> (cherry picked from commit a08a2ae)
1 parent e232128 commit 19d5f2c

File tree

8 files changed

+250
-22
lines changed

8 files changed

+250
-22
lines changed

drivers/base/memory.c

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -173,16 +173,73 @@ static int memory_block_online(struct memory_block *mem)
173173
{
174174
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
175175
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
176+
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
177+
struct zone *zone;
178+
int ret;
179+
180+
zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
181+
182+
/*
183+
* Although vmemmap pages have a different lifecycle than the pages
184+
* they describe (they remain until the memory is unplugged), doing
185+
* their initialization and accounting at memory onlining/offlining
186+
* stage helps to keep accounting easier to follow - e.g vmemmaps
187+
* belong to the same zone as the memory they backed.
188+
*/
189+
if (nr_vmemmap_pages) {
190+
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, zone);
191+
if (ret)
192+
return ret;
193+
}
194+
195+
ret = online_pages(start_pfn + nr_vmemmap_pages,
196+
nr_pages - nr_vmemmap_pages, zone);
197+
if (ret) {
198+
if (nr_vmemmap_pages)
199+
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
200+
return ret;
201+
}
202+
203+
/*
204+
* Account once onlining succeeded. If the zone was unpopulated, it is
205+
* now already properly populated.
206+
*/
207+
if (nr_vmemmap_pages)
208+
adjust_present_page_count(zone, nr_vmemmap_pages);
176209

177-
return online_pages(start_pfn, nr_pages, mem->online_type, mem->nid);
210+
return ret;
178211
}
179212

180213
static int memory_block_offline(struct memory_block *mem)
181214
{
182215
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
183216
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
217+
unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
218+
struct zone *zone;
219+
int ret;
220+
221+
zone = page_zone(pfn_to_page(start_pfn));
222+
223+
/*
224+
* Unaccount before offlining, such that unpopulated zone and kthreads
225+
* can properly be torn down in offline_pages().
226+
*/
227+
if (nr_vmemmap_pages)
228+
adjust_present_page_count(zone, -nr_vmemmap_pages);
184229

185-
return offline_pages(start_pfn, nr_pages);
230+
ret = offline_pages(start_pfn + nr_vmemmap_pages,
231+
nr_pages - nr_vmemmap_pages);
232+
if (ret) {
233+
/* offline_pages() failed. Account back. */
234+
if (nr_vmemmap_pages)
235+
adjust_present_page_count(zone, nr_vmemmap_pages);
236+
return ret;
237+
}
238+
239+
if (nr_vmemmap_pages)
240+
mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
241+
242+
return ret;
186243
}
187244

188245
/*
@@ -603,7 +660,8 @@ int register_memory(struct memory_block *memory)
603660
return ret;
604661
}
605662

606-
static int init_memory_block(unsigned long block_id, unsigned long state)
663+
static int init_memory_block(unsigned long block_id, unsigned long state,
664+
unsigned long nr_vmemmap_pages)
607665
{
608666
struct memory_block *mem;
609667
int ret = 0;
@@ -620,6 +678,7 @@ static int init_memory_block(unsigned long block_id, unsigned long state)
620678
mem->start_section_nr = block_id * sections_per_block;
621679
mem->state = state;
622680
mem->nid = NUMA_NO_NODE;
681+
mem->nr_vmemmap_pages = nr_vmemmap_pages;
623682

624683
ret = register_memory(mem);
625684

@@ -639,7 +698,7 @@ static int add_memory_block(unsigned long base_section_nr)
639698
if (section_count == 0)
640699
return 0;
641700
return init_memory_block(memory_block_id(base_section_nr),
642-
MEM_ONLINE);
701+
MEM_ONLINE, 0);
643702
}
644703

645704
static void unregister_memory(struct memory_block *memory)
@@ -661,7 +720,8 @@ static void unregister_memory(struct memory_block *memory)
661720
*
662721
* Called under device_hotplug_lock.
663722
*/
664-
int create_memory_block_devices(unsigned long start, unsigned long size)
723+
int create_memory_block_devices(unsigned long start, unsigned long size,
724+
unsigned long vmemmap_pages)
665725
{
666726
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
667727
unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -674,7 +734,7 @@ int create_memory_block_devices(unsigned long start, unsigned long size)
674734
return -EINVAL;
675735

676736
for (block_id = start_block_id; block_id != end_block_id; block_id++) {
677-
ret = init_memory_block(block_id, MEM_OFFLINE);
737+
ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
678738
if (ret)
679739
break;
680740
}

include/linux/memory.h

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,11 @@ struct memory_block {
2929
int online_type; /* for passing data to online routine */
3030
int nid; /* NID for this memory block */
3131
struct device dev;
32+
/*
33+
* Number of vmemmap pages. These pages
34+
* lay at the beginning of the memory block.
35+
*/
36+
unsigned long nr_vmemmap_pages;
3237
};
3338

3439
int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -80,7 +85,8 @@ static inline int memory_notify(unsigned long val, void *v)
8085
#else
8186
extern int register_memory_notifier(struct notifier_block *nb);
8287
extern void unregister_memory_notifier(struct notifier_block *nb);
83-
int create_memory_block_devices(unsigned long start, unsigned long size);
88+
int create_memory_block_devices(unsigned long start, unsigned long size,
89+
unsigned long vmemmap_pages);
8490
void remove_memory_block_devices(unsigned long start, unsigned long size);
8591
extern void memory_dev_init(void);
8692
extern int memory_notify(unsigned long val, void *v);

include/linux/memory_hotplug.h

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,14 @@ typedef int __bitwise mhp_t;
7070
*/
7171
#define MEMHP_MERGE_RESOURCE ((__force mhp_t)BIT(0))
7272

73+
/*
74+
* We want memmap (struct page array) to be self contained.
75+
* To do so, we will use the beginning of the hot-added range to build
76+
* the page tables for the memmap array that describes the entire range.
77+
* Only selected architectures support it with SPARSE_VMEMMAP.
78+
*/
79+
#define MHP_MEMMAP_ON_MEMORY ((__force mhp_t)BIT(1))
80+
7381
/*
7482
* Extended parameters for memory hotplug:
7583
* altmap: alternative allocator for memmap array (optional)
@@ -111,9 +119,13 @@ static inline void zone_seqlock_init(struct zone *zone)
111119
extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
112120
extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
113121
extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
122+
extern void adjust_present_page_count(struct zone *zone, long nr_pages);
114123
/* VM interface that may be used by firmware interface */
124+
extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
125+
struct zone *zone);
126+
extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
115127
extern int online_pages(unsigned long pfn, unsigned long nr_pages,
116-
int online_type, int nid);
128+
struct zone *zone);
117129
extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
118130
unsigned long end_pfn);
119131
extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -361,6 +373,7 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
361373
unsigned long pnum);
362374
extern struct zone *zone_for_pfn_range(int online_type, int nid,
363375
unsigned long start_pfn, unsigned long nr_pages);
376+
extern bool mhp_supports_memmap_on_memory(unsigned long size);
364377
#endif /* CONFIG_MEMORY_HOTPLUG */
365378

366379
#endif /* __LINUX_MEMORY_HOTPLUG_H */

include/linux/memremap.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ struct device;
1717
* @alloc: track pages consumed, private to vmemmap_populate()
1818
*/
1919
struct vmem_altmap {
20-
const unsigned long base_pfn;
20+
unsigned long base_pfn;
2121
const unsigned long end_pfn;
2222
const unsigned long reserve;
2323
unsigned long free;

include/linux/mmzone.h

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,11 @@ enum zone_type {
406406
* techniques might use alloc_contig_range() to hide previously
407407
* exposed pages from the buddy again (e.g., to implement some sort
408408
* of memory unplug in virtio-mem).
409+
* 6. Memory-hotplug: when using memmap_on_memory and onlining the
410+
* memory to the MOVABLE zone, the vmemmap pages are also placed in
411+
* such zone. Such pages cannot be really moved around as they are
412+
* self-stored in the range, but they are treated as movable when
413+
* the range they describe is about to be offlined.
409414
*
410415
* In general, no unmovable allocations that degrade memory offlining
411416
* should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
@@ -1354,10 +1359,8 @@ static inline int online_section_nr(unsigned long nr)
13541359

13551360
#ifdef CONFIG_MEMORY_HOTPLUG
13561361
void online_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
1357-
#ifdef CONFIG_MEMORY_HOTREMOVE
13581362
void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn);
13591363
#endif
1360-
#endif
13611364

13621365
static inline struct mem_section *__pfn_to_section(unsigned long pfn)
13631366
{

mm/Kconfig

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -183,6 +183,11 @@ config MEMORY_HOTREMOVE
183183
depends on MEMORY_HOTPLUG && ARCH_ENABLE_MEMORY_HOTREMOVE
184184
depends on MIGRATION
185185

186+
config MHP_MEMMAP_ON_MEMORY
187+
def_bool y
188+
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
189+
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
190+
186191
# Heavily threaded applications may benefit from splitting the mm-wide
187192
# page_table_lock, so that faults on different parts of the user address
188193
# space can be handled with less contention: split it at this NR_CPUS.

0 commit comments

Comments
 (0)