diff options
| author | Alexander Graf <graf@amazon.com> | 2025-06-10 11:53:27 +0300 |
|---|---|---|
| committer | Andrew Morton <akpm@linux-foundation.org> | 2025-08-02 22:01:38 +0300 |
| commit | 07d24902977e4704fab8472981e73a0ad6dfa1fd (patch) | |
| tree | d5b37b1b5b2e62f8c4a7f6ab998b1aeb8f8eecab /include/linux | |
| parent | ed4f142f72a9191b8236778093074c277435bf8a (diff) | |
| download | linux-07d24902977e4704fab8472981e73a0ad6dfa1fd.tar.xz | |
kexec: enable CMA based contiguous allocation
When booting a new kernel with kexec_file, the kernel picks a target
location that the kernel should live at, then allocates random pages,
checks whether any of those patches magically happens to coincide with a
target address range and if so, uses them for that range.
For every page allocated this way, it then creates a page list that the
relocation code - code that executes while all CPUs are off and we are
just about to jump into the new kernel - copies to their final memory
location. We can not put them there before, because chances are pretty
good that at least some page in the target range is already in use by the
currently running Linux environment. Copying is happening from a single
CPU at RAM rate, which takes around 4-50 ms per 100 MiB.
All of this is inefficient and error prone.
To successfully kexec, we need to quiesce all devices of the outgoing
kernel so they don't scribble over the new kernel's memory. We have seen
cases where that does not happen properly (*cough* GIC *cough*) and hence
the new kernel was corrupted. This started a month long journey to root
cause failing kexecs to eventually see memory corruption, because the new
kernel was corrupted severely enough that it could not emit output to tell
us about the fact that it was corrupted. By allocating memory for the
next kernel from a memory range that is guaranteed scribbling free, we can
boot the next kernel up to a point where it is at least able to detect
corruption and maybe even stop it before it becomes severe. This
increases the chance for successful kexecs.
Since kexec got introduced, Linux has gained the CMA framework which can
perform physically contiguous memory mappings, while keeping that memory
available for movable memory when it is not needed for contiguous
allocations. The default CMA allocator is for DMA allocations.
This patch adds logic to the kexec file loader to attempt to place the
target payload at a location allocated from CMA. If successful, it uses
that memory range directly instead of creating copy instructions during
the hot phase. To ensure that there is a safety net in case anything goes
wrong with the CMA allocation, it also adds a flag for user space to force
disable CMA allocations.
Using CMA allocations has two advantages:
1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
hot phase.
2) More robust. Even if by accident some page is still in use for DMA,
the new kernel image will be safe from that access because it resides
in a memory region that is considered allocated in the old kernel and
has a chance to reinitialize that component.
Link: https://lkml.kernel.org/r/20250610085327.51817-1-graf@amazon.com
Signed-off-by: Alexander Graf <graf@amazon.com>
Acked-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Diffstat (limited to 'include/linux')
| -rw-r--r-- | include/linux/kexec.h | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 03f85ad03025..1b10a5d84b68 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -79,6 +79,12 @@ extern note_buf_t __percpu *crash_notes; typedef unsigned long kimage_entry_t; +/* + * This is a copy of the UAPI struct kexec_segment and must be identical + * to it because it gets copied straight from user space into kernel + * memory. Do not modify this structure unless you change the way segments + * get ingested from user space. + */ struct kexec_segment { /* * This pointer can point to user memory if kexec_load() system @@ -172,6 +178,7 @@ int kexec_image_post_load_cleanup_default(struct kimage *image); * @buf_align: Minimum alignment needed. * @buf_min: The buffer can't be placed below this address. * @buf_max: The buffer can't be placed above this address. + * @cma: CMA page if the buffer is backed by CMA. * @top_down: Allocate from top of memory. * @random: Place the buffer at a random position. */ @@ -184,6 +191,7 @@ struct kexec_buf { unsigned long buf_align; unsigned long buf_min; unsigned long buf_max; + struct page *cma; bool top_down; #ifdef CONFIG_CRASH_DUMP bool random; @@ -340,6 +348,7 @@ struct kimage { unsigned long nr_segments; struct kexec_segment segment[KEXEC_SEGMENT_MAX]; + struct page *segment_cma[KEXEC_SEGMENT_MAX]; struct list_head control_pages; struct list_head dest_pages; @@ -361,6 +370,7 @@ struct kimage { */ unsigned int hotplug_support:1; #endif + unsigned int no_cma:1; #ifdef ARCH_HAS_KIMAGE_ARCH struct kimage_arch arch; |
