From dcdfdd40fa82b6704d2841938e5c8ec3051eb0d6 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Tue, 6 Jun 2023 17:26:29 +0300 Subject: mm: Add support for unaccepted memory UEFI Specification version 2.9 introduces the concept of memory acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, require memory to be accepted before it can be used by the guest. Accepting happens via a protocol specific to the Virtual Machine platform. There are several ways the kernel can deal with unaccepted memory: 1. Accept all the memory during boot. It is easy to implement and it doesn't have runtime cost once the system is booted. The downside is very long boot time. Accept can be parallelized to multiple CPUs to keep it manageable (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate memory bandwidth and does not scale beyond the point. 2. Accept a block of memory on the first use. It requires more infrastructure and changes in page allocator to make it work, but it provides good boot time. On-demand memory accept means latency spikes every time kernel steps onto a new memory block. The spikes will go away once workload data set size gets stabilized or all memory gets accepted. 3. Accept all memory in background. Introduce a thread (or multiple) that gets memory accepted proactively. It will minimize time the system experience latency spikes on memory allocation while keeping low boot time. This approach cannot function on its own. It is an extension of #2: background memory acceptance requires functional scheduler, but the page allocator may need to tap into unaccepted memory before that. The downside of the approach is that these threads also steal CPU cycles and memory bandwidth from the user's workload and may hurt user experience. Implement #1 and #2 for now. #2 is the default. Some workloads may want to use #1 with accept_memory=eager in kernel command line. #3 can be implemented later based on user's demands. Support of unaccepted memory requires a few changes in core-mm code: - memblock accepts memory on allocation. It serves early boot memory allocations and doesn't limit them to pre-accepted pool of memory. - page allocator accepts memory on the first allocation of the page. When kernel runs out of accepted memory, it accepts memory until the high watermark is reached. It helps to minimize fragmentation. EFI code will provide two helpers if the platform supports unaccepted memory: - accept_memory() makes a range of physical addresses accepted. - range_contains_unaccepted_memory() checks anything within the range of physical addresses requires acceptance. Signed-off-by: Kirill A. Shutemov Signed-off-by: Borislav Petkov (AMD) Reviewed-by: Vlastimil Babka Acked-by: Mike Rapoport # memblock Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com --- mm/vmstat.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index c28046371b45..282349cabf01 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1180,6 +1180,9 @@ const char * const vmstat_text[] = { "nr_zspages", #endif "nr_free_cma", +#ifdef CONFIG_UNACCEPTED_MEMORY + "nr_unaccepted", +#endif /* enum numa_stat_item counters */ #ifdef CONFIG_NUMA -- cgit v1.2.3 From be5e015d107d5336f298b74ea5a4f0b1773bc6f9 Mon Sep 17 00:00:00 2001 From: Marcelo Tosatti Date: Wed, 7 Jun 2023 17:28:07 -0300 Subject: vmstat: skip periodic vmstat update for isolated CPUs Problem: The interruption caused by vmstat_update is undesirable for certain applications. With workloads that are running on isolated cpus with nohz full mode to shield off any kernel interruption. For example, a VM running a time sensitive application with a 50us maximum acceptable interruption (use case: soft PLC). oslat 1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000) oslat 1094.456971: workqueue_queue_work: ... function=vmstat_update ... oslat 1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ... kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ... The example above shows an additional 7us for the oslat -> kworker -> oslat switches. In the case of a virtualized CPU, and the vmstat_update interruption in the host (of a qemu-kvm vcpu), the latency penalty observed in the guest is higher than 50us, violating the acceptable latency threshold. The isolated vCPU can perform operations that modify per-CPU page counters, for example to complete I/O operations: CPU 11/KVM-9540 [001] dNh1. 2314.248584: mod_zone_page_state <-__folio_end_writeback CPU 11/KVM-9540 [001] dNh1. 2314.248585: => 0xffffffffc042b083 => mod_zone_page_state => __folio_end_writeback => folio_end_writeback => iomap_finish_ioend => blk_mq_end_request_batch => nvme_irq => __handle_irq_event_percpu => handle_irq_event => handle_edge_irq => __common_interrupt => common_interrupt => asm_common_interrupt => vmx_do_interrupt_nmi_irqoff => vmx_handle_exit_irqoff => vcpu_enter_guest => vcpu_run => kvm_arch_vcpu_ioctl_run => kvm_vcpu_ioctl => __x64_sys_ioctl => do_syscall_64 => entry_SYSCALL_64_after_hwframe In kernel users of vmstat counters either require the precise value and they are using zone_page_state_snapshot interface or they can live with an imprecision as the regular flushing can happen at arbitrary time and cumulative error can grow (see calculate_normal_threshold). From that POV the regular flushing can be postponed for CPUs that have been isolated from the kernel interference without critical infrastructure ever noticing. Skip regular flushing from vmstat_shepherd for all isolated CPUs to avoid interference with the isolated workload. Suggested by Michal Hocko. Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpad Signed-off-by: Marcelo Tosatti Acked-by: Michal Hocko Cc: Frederic Weisbecker Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- mm/vmstat.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) (limited to 'mm/vmstat.c') diff --git a/mm/vmstat.c b/mm/vmstat.c index c28046371b45..c7499e3ee9d5 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -28,6 +28,7 @@ #include #include #include +#include #include "internal.h" @@ -2022,6 +2023,20 @@ static void vmstat_shepherd(struct work_struct *w) for_each_online_cpu(cpu) { struct delayed_work *dw = &per_cpu(vmstat_work, cpu); + /* + * In kernel users of vmstat counters either require the precise value and + * they are using zone_page_state_snapshot interface or they can live with + * an imprecision as the regular flushing can happen at arbitrary time and + * cumulative error can grow (see calculate_normal_threshold). + * + * From that POV the regular flushing can be postponed for CPUs that have + * been isolated from the kernel interference without critical + * infrastructure ever noticing. Skip regular flushing from vmstat_shepherd + * for all isolated CPUs to avoid interference with the isolated workload. + */ + if (cpu_is_isolated(cpu)) + continue; + if (!delayed_work_pending(dw) && need_update(cpu)) queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0); -- cgit v1.2.3