From daec8d408308ee7322d86cdd2dc3332e9cdbedf9 Mon Sep 17 00:00:00 2001 From: Paolo Bonzini Date: Tue, 22 Mar 2022 12:07:10 +0100 Subject: Documentation: KVM: add separate directories for architecture-specific documentation ARM already has an arm/ subdirectory, but s390 and x86 do not even though they have a relatively large number of files specific to them. Create new directories in Documentation/virt/kvm for these two architectures as well. While at it, group the API documentation and the developer documentation in the table of contents. Signed-off-by: Paolo Bonzini Message-Id: <20220322110712.222449-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini --- Documentation/virt/kvm/amd-memory-encryption.rst | 445 -------------- Documentation/virt/kvm/cpuid.rst | 124 ---- Documentation/virt/kvm/halt-polling.rst | 140 ----- Documentation/virt/kvm/hypercalls.rst | 192 ------ Documentation/virt/kvm/index.rst | 26 +- Documentation/virt/kvm/mmu.rst | 480 --------------- Documentation/virt/kvm/msr.rst | 391 ------------- Documentation/virt/kvm/nested-vmx.rst | 244 -------- Documentation/virt/kvm/running-nested-guests.rst | 276 --------- Documentation/virt/kvm/s390-diag.rst | 119 ---- Documentation/virt/kvm/s390-pv-boot.rst | 84 --- Documentation/virt/kvm/s390-pv.rst | 116 ---- Documentation/virt/kvm/s390/index.rst | 12 + Documentation/virt/kvm/s390/s390-diag.rst | 119 ++++ Documentation/virt/kvm/s390/s390-pv-boot.rst | 84 +++ Documentation/virt/kvm/s390/s390-pv.rst | 116 ++++ Documentation/virt/kvm/timekeeping.rst | 645 --------------------- .../virt/kvm/x86/amd-memory-encryption.rst | 445 ++++++++++++++ Documentation/virt/kvm/x86/cpuid.rst | 124 ++++ Documentation/virt/kvm/x86/halt-polling.rst | 140 +++++ Documentation/virt/kvm/x86/hypercalls.rst | 192 ++++++ Documentation/virt/kvm/x86/index.rst | 18 + Documentation/virt/kvm/x86/mmu.rst | 480 +++++++++++++++ Documentation/virt/kvm/x86/msr.rst | 391 +++++++++++++ Documentation/virt/kvm/x86/nested-vmx.rst | 244 ++++++++ .../virt/kvm/x86/running-nested-guests.rst | 276 +++++++++ Documentation/virt/kvm/x86/timekeeping.rst | 645 +++++++++++++++++++++ 27 files changed, 3293 insertions(+), 3275 deletions(-) delete mode 100644 Documentation/virt/kvm/amd-memory-encryption.rst delete mode 100644 Documentation/virt/kvm/cpuid.rst delete mode 100644 Documentation/virt/kvm/halt-polling.rst delete mode 100644 Documentation/virt/kvm/hypercalls.rst delete mode 100644 Documentation/virt/kvm/mmu.rst delete mode 100644 Documentation/virt/kvm/msr.rst delete mode 100644 Documentation/virt/kvm/nested-vmx.rst delete mode 100644 Documentation/virt/kvm/running-nested-guests.rst delete mode 100644 Documentation/virt/kvm/s390-diag.rst delete mode 100644 Documentation/virt/kvm/s390-pv-boot.rst delete mode 100644 Documentation/virt/kvm/s390-pv.rst create mode 100644 Documentation/virt/kvm/s390/index.rst create mode 100644 Documentation/virt/kvm/s390/s390-diag.rst create mode 100644 Documentation/virt/kvm/s390/s390-pv-boot.rst create mode 100644 Documentation/virt/kvm/s390/s390-pv.rst delete mode 100644 Documentation/virt/kvm/timekeeping.rst create mode 100644 Documentation/virt/kvm/x86/amd-memory-encryption.rst create mode 100644 Documentation/virt/kvm/x86/cpuid.rst create mode 100644 Documentation/virt/kvm/x86/halt-polling.rst create mode 100644 Documentation/virt/kvm/x86/hypercalls.rst create mode 100644 Documentation/virt/kvm/x86/index.rst create mode 100644 Documentation/virt/kvm/x86/mmu.rst create mode 100644 Documentation/virt/kvm/x86/msr.rst create mode 100644 Documentation/virt/kvm/x86/nested-vmx.rst create mode 100644 Documentation/virt/kvm/x86/running-nested-guests.rst create mode 100644 Documentation/virt/kvm/x86/timekeeping.rst diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst deleted file mode 100644 index 1c6847fff304..000000000000 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ /dev/null @@ -1,445 +0,0 @@ -====================================== -Secure Encrypted Virtualization (SEV) -====================================== - -Overview -======== - -Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. - -SEV is an extension to the AMD-V architecture which supports running -virtual machines (VMs) under the control of a hypervisor. When enabled, -the memory contents of a VM will be transparently encrypted with a key -unique to that VM. - -The hypervisor can determine the SEV support through the CPUID -instruction. The CPUID function 0x8000001f reports information related -to SEV:: - - 0x8000001f[eax]: - Bit[1] indicates support for SEV - ... - [ecx]: - Bits[31:0] Number of encrypted guests supported simultaneously - -If support for SEV is present, MSR 0xc001_0010 (MSR_AMD64_SYSCFG) and MSR 0xc001_0015 -(MSR_K7_HWCR) can be used to determine if it can be enabled:: - - 0xc001_0010: - Bit[23] 1 = memory encryption can be enabled - 0 = memory encryption can not be enabled - - 0xc001_0015: - Bit[0] 1 = memory encryption can be enabled - 0 = memory encryption can not be enabled - -When SEV support is available, it can be enabled in a specific VM by -setting the SEV bit before executing VMRUN.:: - - VMCB[0x90]: - Bit[1] 1 = SEV is enabled - 0 = SEV is disabled - -SEV hardware uses ASIDs to associate a memory encryption key with a VM. -Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value -defined in the CPUID 0x8000001f[ecx] field. - -SEV Key Management -================== - -The SEV guest key management is handled by a separate processor called the AMD -Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure -key management interface to perform common hypervisor activities such as -encrypting bootstrap code, snapshot, migrating and debugging the guest. For more -information, see the SEV Key Management spec [api-spec]_ - -The main ioctl to access SEV is KVM_MEMORY_ENCRYPT_OP. If the argument -to KVM_MEMORY_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled -and ``ENOTTY` if it is disabled (on some older versions of Linux, -the ioctl runs normally even with a NULL argument, and therefore will -likely return ``EFAULT``). If non-NULL, the argument to KVM_MEMORY_ENCRYPT_OP -must be a struct kvm_sev_cmd:: - - struct kvm_sev_cmd { - __u32 id; - __u64 data; - __u32 error; - __u32 sev_fd; - }; - - -The ``id`` field contains the subcommand, and the ``data`` field points to -another struct containing arguments specific to command. The ``sev_fd`` -should point to a file descriptor that is opened on the ``/dev/sev`` -device, if needed (see individual commands). - -On output, ``error`` is zero on success, or an error code. Error codes -are defined in ````. - -KVM implements the following commands to support common lifecycle events of SEV -guests, such as launching, running, snapshotting, migrating and decommissioning. - -1. KVM_SEV_INIT ---------------- - -The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform -context. In a typical workflow, this command should be the first command issued. - -The firmware can be initialized either by using its own non-volatile storage or -the OS can manage the NV storage for the firmware using the module parameter -``init_ex_path``. The file specified by ``init_ex_path`` must exist. To create -a new NV storage file allocate the file with 32KB bytes of 0xFF as required by -the SEV spec. - -Returns: 0 on success, -negative on error - -2. KVM_SEV_LAUNCH_START ------------------------ - -The KVM_SEV_LAUNCH_START command is used for creating the memory encryption -context. To create the encryption context, user must provide a guest policy, -the owner's public Diffie-Hellman (PDH) key and session information. - -Parameters: struct kvm_sev_launch_start (in/out) - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_start { - __u32 handle; /* if zero then firmware creates a new handle */ - __u32 policy; /* guest's policy */ - - __u64 dh_uaddr; /* userspace address pointing to the guest owner's PDH key */ - __u32 dh_len; - - __u64 session_addr; /* userspace address which points to the guest session information */ - __u32 session_len; - }; - -On success, the 'handle' field contains a new handle and on error, a negative value. - -KVM_SEV_LAUNCH_START requires the ``sev_fd`` field to be valid. - -For more details, see SEV spec Section 6.2. - -3. KVM_SEV_LAUNCH_UPDATE_DATA ------------------------------ - -The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also -calculates a measurement of the memory contents. The measurement is a signature -of the memory contents that can be sent to the guest owner as an attestation -that the memory was encrypted correctly by the firmware. - -Parameters (in): struct kvm_sev_launch_update_data - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_update { - __u64 uaddr; /* userspace address to be encrypted (must be 16-byte aligned) */ - __u32 len; /* length of the data to be encrypted (must be 16-byte aligned) */ - }; - -For more details, see SEV spec Section 6.3. - -4. KVM_SEV_LAUNCH_MEASURE -------------------------- - -The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the -data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may -wait to provide the guest with confidential information until it can verify the -measurement. Since the guest owner knows the initial contents of the guest at -boot, the measurement can be verified by comparing it to what the guest owner -expects. - -If len is zero on entry, the measurement blob length is written to len and -uaddr is unused. - -Parameters (in): struct kvm_sev_launch_measure - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_measure { - __u64 uaddr; /* where to copy the measurement */ - __u32 len; /* length of measurement blob */ - }; - -For more details on the measurement verification flow, see SEV spec Section 6.4. - -5. KVM_SEV_LAUNCH_FINISH ------------------------- - -After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be -issued to make the guest ready for the execution. - -Returns: 0 on success, -negative on error - -6. KVM_SEV_GUEST_STATUS ------------------------ - -The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a -SEV-enabled guest. - -Parameters (out): struct kvm_sev_guest_status - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_guest_status { - __u32 handle; /* guest handle */ - __u32 policy; /* guest policy */ - __u8 state; /* guest state (see enum below) */ - }; - -SEV guest state: - -:: - - enum { - SEV_STATE_INVALID = 0; - SEV_STATE_LAUNCHING, /* guest is currently being launched */ - SEV_STATE_SECRET, /* guest is being launched and ready to accept the ciphertext data */ - SEV_STATE_RUNNING, /* guest is fully launched and running */ - SEV_STATE_RECEIVING, /* guest is being migrated in from another SEV machine */ - SEV_STATE_SENDING /* guest is getting migrated out to another SEV machine */ - }; - -7. KVM_SEV_DBG_DECRYPT ----------------------- - -The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the -firmware to decrypt the data at the given memory region. - -Parameters (in): struct kvm_sev_dbg - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_dbg { - __u64 src_uaddr; /* userspace address of data to decrypt */ - __u64 dst_uaddr; /* userspace address of destination */ - __u32 len; /* length of memory region to decrypt */ - }; - -The command returns an error if the guest policy does not allow debugging. - -8. KVM_SEV_DBG_ENCRYPT ----------------------- - -The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the -firmware to encrypt the data at the given memory region. - -Parameters (in): struct kvm_sev_dbg - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_dbg { - __u64 src_uaddr; /* userspace address of data to encrypt */ - __u64 dst_uaddr; /* userspace address of destination */ - __u32 len; /* length of memory region to encrypt */ - }; - -The command returns an error if the guest policy does not allow debugging. - -9. KVM_SEV_LAUNCH_SECRET ------------------------- - -The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret -data after the measurement has been validated by the guest owner. - -Parameters (in): struct kvm_sev_launch_secret - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_secret { - __u64 hdr_uaddr; /* userspace address containing the packet header */ - __u32 hdr_len; - - __u64 guest_uaddr; /* the guest memory region where the secret should be injected */ - __u32 guest_len; - - __u64 trans_uaddr; /* the hypervisor memory region which contains the secret */ - __u32 trans_len; - }; - -10. KVM_SEV_GET_ATTESTATION_REPORT ----------------------------------- - -The KVM_SEV_GET_ATTESTATION_REPORT command can be used by the hypervisor to query the attestation -report containing the SHA-256 digest of the guest memory and VMSA passed through the KVM_SEV_LAUNCH -commands and signed with the PEK. The digest returned by the command should match the digest -used by the guest owner with the KVM_SEV_LAUNCH_MEASURE. - -If len is zero on entry, the measurement blob length is written to len and -uaddr is unused. - -Parameters (in): struct kvm_sev_attestation - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_attestation_report { - __u8 mnonce[16]; /* A random mnonce that will be placed in the report */ - - __u64 uaddr; /* userspace address where the report should be copied */ - __u32 len; - }; - -11. KVM_SEV_SEND_START ----------------------- - -The KVM_SEV_SEND_START command can be used by the hypervisor to create an -outgoing guest encryption context. - -If session_len is zero on entry, the length of the guest session information is -written to session_len and all other fields are not used. - -Parameters (in): struct kvm_sev_send_start - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_send_start { - __u32 policy; /* guest policy */ - - __u64 pdh_cert_uaddr; /* platform Diffie-Hellman certificate */ - __u32 pdh_cert_len; - - __u64 plat_certs_uaddr; /* platform certificate chain */ - __u32 plat_certs_len; - - __u64 amd_certs_uaddr; /* AMD certificate */ - __u32 amd_certs_len; - - __u64 session_uaddr; /* Guest session information */ - __u32 session_len; - }; - -12. KVM_SEV_SEND_UPDATE_DATA ----------------------------- - -The KVM_SEV_SEND_UPDATE_DATA command can be used by the hypervisor to encrypt the -outgoing guest memory region with the encryption context creating using -KVM_SEV_SEND_START. - -If hdr_len or trans_len are zero on entry, the length of the packet header and -transport region are written to hdr_len and trans_len respectively, and all -other fields are not used. - -Parameters (in): struct kvm_sev_send_update_data - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_send_update_data { - __u64 hdr_uaddr; /* userspace address containing the packet header */ - __u32 hdr_len; - - __u64 guest_uaddr; /* the source memory region to be encrypted */ - __u32 guest_len; - - __u64 trans_uaddr; /* the destination memory region */ - __u32 trans_len; - }; - -13. KVM_SEV_SEND_FINISH ------------------------- - -After completion of the migration flow, the KVM_SEV_SEND_FINISH command can be -issued by the hypervisor to delete the encryption context. - -Returns: 0 on success, -negative on error - -14. KVM_SEV_SEND_CANCEL ------------------------- - -After completion of SEND_START, but before SEND_FINISH, the source VMM can issue the -SEND_CANCEL command to stop a migration. This is necessary so that a cancelled -migration can restart with a new target later. - -Returns: 0 on success, -negative on error - -15. KVM_SEV_RECEIVE_START -------------------------- - -The KVM_SEV_RECEIVE_START command is used for creating the memory encryption -context for an incoming SEV guest. To create the encryption context, the user must -provide a guest policy, the platform public Diffie-Hellman (PDH) key and session -information. - -Parameters: struct kvm_sev_receive_start (in/out) - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_receive_start { - __u32 handle; /* if zero then firmware creates a new handle */ - __u32 policy; /* guest's policy */ - - __u64 pdh_uaddr; /* userspace address pointing to the PDH key */ - __u32 pdh_len; - - __u64 session_uaddr; /* userspace address which points to the guest session information */ - __u32 session_len; - }; - -On success, the 'handle' field contains a new handle and on error, a negative value. - -For more details, see SEV spec Section 6.12. - -16. KVM_SEV_RECEIVE_UPDATE_DATA -------------------------------- - -The KVM_SEV_RECEIVE_UPDATE_DATA command can be used by the hypervisor to copy -the incoming buffers into the guest memory region with encryption context -created during the KVM_SEV_RECEIVE_START. - -Parameters (in): struct kvm_sev_receive_update_data - -Returns: 0 on success, -negative on error - -:: - - struct kvm_sev_launch_receive_update_data { - __u64 hdr_uaddr; /* userspace address containing the packet header */ - __u32 hdr_len; - - __u64 guest_uaddr; /* the destination guest memory region */ - __u32 guest_len; - - __u64 trans_uaddr; /* the incoming buffer memory region */ - __u32 trans_len; - }; - -17. KVM_SEV_RECEIVE_FINISH --------------------------- - -After completion of the migration flow, the KVM_SEV_RECEIVE_FINISH command can be -issued by the hypervisor to make the guest ready for execution. - -Returns: 0 on success, -negative on error - -References -========== - - -See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info. - -.. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf -.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf -.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34) -.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst deleted file mode 100644 index bda3e3e737d7..000000000000 --- a/Documentation/virt/kvm/cpuid.rst +++ /dev/null @@ -1,124 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -============== -KVM CPUID bits -============== - -:Author: Glauber Costa - -A guest running on a kvm host, can check some of its features using -cpuid. This is not always guaranteed to work, since userspace can -mask-out some, or even all KVM-related cpuid features before launching -a guest. - -KVM cpuid functions are: - -function: KVM_CPUID_SIGNATURE (0x40000000) - -returns:: - - eax = 0x40000001 - ebx = 0x4b4d564b - ecx = 0x564b4d56 - edx = 0x4d - -Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". -The value in eax corresponds to the maximum cpuid function present in this leaf, -and will be updated if more functions are added in the future. -Note also that old hosts set eax value to 0x0. This should -be interpreted as if the value was 0x40000001. -This function queries the presence of KVM cpuid leafs. - -function: define KVM_CPUID_FEATURES (0x40000001) - -returns:: - - ebx, ecx - eax = an OR'ed group of (1 << flag) - -where ``flag`` is defined as below: - -================================== =========== ================================ -flag value meaning -================================== =========== ================================ -KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs - 0x11 and 0x12 - -KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays - on PIO operations - -KVM_FEATURE_MMU_OP 2 deprecated - -KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs - 0x4b564d00 and 0x4b564d01 - -KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by - writing to msr 0x4b564d02 - -KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by - writing to msr 0x4b564d03 - -KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt - handler can be enabled by - writing to msr 0x4b564d04 - -KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit - before enabling paravirtualized - spinlock support - -KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit - before enabling paravirtualized - tlb flush - -KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT - can be enabled by setting bit 2 - when writing to msr 0x4b564d02 - -KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit - before enabling paravirtualized - send IPIs - -KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can - be disabled by writing - to msr 0x4b564d05. - -KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit - before using paravirtualized - sched yield. - -KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit - before using the second async - pf control msr 0x4b564d06 and - async pf acknowledgment msr - 0x4b564d07. - -KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit - before using extended destination - ID bits in MSI address bits 11-5. - -KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before - using the map gpa range hypercall - to notify the page state change - -KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before - using MSR_KVM_MIGRATION_CONTROL - -KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side - per-cpu warps are expected in - kvmclock -================================== =========== ================================ - -:: - - edx = an OR'ed group of (1 << flag) - -Where ``flag`` here is defined as below: - -================== ============ ================================= -flag value meaning -================== ============ ================================= -KVM_HINTS_REALTIME 0 guest checks this feature bit to - determine that vCPUs are never - preempted for an unlimited time - allowing optimizations -================== ============ ================================= diff --git a/Documentation/virt/kvm/halt-polling.rst b/Documentation/virt/kvm/halt-polling.rst deleted file mode 100644 index 4922e4a15f18..000000000000 --- a/Documentation/virt/kvm/halt-polling.rst +++ /dev/null @@ -1,140 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=========================== -The KVM halt polling system -=========================== - -The KVM halt polling system provides a feature within KVM whereby the latency -of a guest can, under some circumstances, be reduced by polling in the host -for some time period after the guest has elected to no longer run by cedeing. -That is, when a guest vcpu has ceded, or in the case of powerpc when all of the -vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions -before giving up the cpu to the scheduler in order to let something else run. - -Polling provides a latency advantage in cases where the guest can be run again -very quickly by at least saving us a trip through the scheduler, normally on -the order of a few micro-seconds, although performance benefits are workload -dependant. In the event that no wakeup source arrives during the polling -interval or some other task on the runqueue is runnable the scheduler is -invoked. Thus halt polling is especially useful on workloads with very short -wakeup periods where the time spent halt polling is minimised and the time -savings of not invoking the scheduler are distinguishable. - -The generic halt polling code is implemented in: - - virt/kvm/kvm_main.c: kvm_vcpu_block() - -The powerpc kvm-hv specific case is implemented in: - - arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked() - -Halt Polling Interval -===================== - -The maximum time for which to poll before invoking the scheduler, referred to -as the halt polling interval, is increased and decreased based on the perceived -effectiveness of the polling in an attempt to limit pointless polling. -This value is stored in either the vcpu struct: - - kvm_vcpu->halt_poll_ns - -or in the case of powerpc kvm-hv, in the vcore struct: - - kvmppc_vcore->halt_poll_ns - -Thus this is a per vcpu (or vcore) value. - -During polling if a wakeup source is received within the halt polling interval, -the interval is left unchanged. In the event that a wakeup source isn't -received during the polling interval (and thus schedule is invoked) there are -two options, either the polling interval and total block time[0] were less than -the global max polling interval (see module params below), or the total block -time was greater than the global max polling interval. - -In the event that both the polling interval and total block time were less than -the global max polling interval then the polling interval can be increased in -the hope that next time during the longer polling interval the wake up source -will be received while the host is polling and the latency benefits will be -received. The polling interval is grown in the function grow_halt_poll_ns() and -is multiplied by the module parameters halt_poll_ns_grow and -halt_poll_ns_grow_start. - -In the event that the total block time was greater than the global max polling -interval then the host will never poll for long enough (limited by the global -max) to wakeup during the polling interval so it may as well be shrunk in order -to avoid pointless polling. The polling interval is shrunk in the function -shrink_halt_poll_ns() and is divided by the module parameter -halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0. - -It is worth noting that this adjustment process attempts to hone in on some -steady state polling interval but will only really do a good job for wakeups -which come at an approximately constant rate, otherwise there will be constant -adjustment of the polling interval. - -[0] total block time: - the time between when the halt polling function is - invoked and a wakeup source received (irrespective of - whether the scheduler is invoked within that function). - -Module Parameters -================= - -The kvm module has 3 tuneable module parameters to adjust the global max -polling interval as well as the rate at which the polling interval is grown and -shrunk. These variables are defined in include/linux/kvm_host.h and as module -parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the -powerpc kvm-hv case. - -+-----------------------+---------------------------+-------------------------+ -|Module Parameter | Description | Default Value | -+-----------------------+---------------------------+-------------------------+ -|halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT| -| | interval which defines | | -| | the ceiling value of the | | -| | polling interval for | (per arch value) | -| | each vcpu. | | -+-----------------------+---------------------------+-------------------------+ -|halt_poll_ns_grow | The value by which the | 2 | -| | halt polling interval is | | -| | multiplied in the | | -| | grow_halt_poll_ns() | | -| | function. | | -+-----------------------+---------------------------+-------------------------+ -|halt_poll_ns_grow_start| The initial value to grow | 10000 | -| | to from zero in the | | -| | grow_halt_poll_ns() | | -| | function. | | -+-----------------------+---------------------------+-------------------------+ -|halt_poll_ns_shrink | The value by which the | 0 | -| | halt polling interval is | | -| | divided in the | | -| | shrink_halt_poll_ns() | | -| | function. | | -+-----------------------+---------------------------+-------------------------+ - -These module parameters can be set from the debugfs files in: - - /sys/module/kvm/parameters/ - -Note: that these module parameters are system wide values and are not able to - be tuned on a per vm basis. - -Further Notes -============= - -- Care should be taken when setting the halt_poll_ns module parameter as a large value - has the potential to drive the cpu usage to 100% on a machine which would be almost - entirely idle otherwise. This is because even if a guest has wakeups during which very - little work is done and which are quite far apart, if the period is shorter than the - global max polling interval (halt_poll_ns) then the host will always poll for the - entire block time and thus cpu utilisation will go to 100%. - -- Halt polling essentially presents a trade off between power usage and latency and - the module parameters should be used to tune the affinity for this. Idle cpu time is - essentially converted to host kernel time with the aim of decreasing latency when - entering the guest. - -- Halt polling will only be conducted by the host when no other tasks are runnable on - that cpu, otherwise the polling will cease immediately and schedule will be invoked to - allow that other task to run. Thus this doesn't allow a guest to denial of service the - cpu. diff --git a/Documentation/virt/kvm/hypercalls.rst b/Documentation/virt/kvm/hypercalls.rst deleted file mode 100644 index e56fa8b9cfca..000000000000 --- a/Documentation/virt/kvm/hypercalls.rst +++ /dev/null @@ -1,192 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=================== -Linux KVM Hypercall -=================== - -X86: - KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall - instruction. The hypervisor can replace it with instructions that are - guaranteed to be supported. - - Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively. - The hypercall number should be placed in rax and the return value will be - placed in rax. No other registers will be clobbered unless explicitly stated - by the particular hypercall. - -S390: - R2-R7 are used for parameters 1-6. In addition, R1 is used for hypercall - number. The return value is written to R2. - - S390 uses diagnose instruction as hypercall (0x500) along with hypercall - number in R1. - - For further information on the S390 diagnose call as supported by KVM, - refer to Documentation/virt/kvm/s390-diag.rst. - -PowerPC: - It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers. - Return value is placed in R3. - - KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions' - property inside the device tree's /hypervisor node. - For more information refer to Documentation/virt/kvm/ppc-pv.rst - -MIPS: - KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall - number in $2 (v0). Up to four arguments may be placed in $4-$7 (a0-a3) and - the return value is placed in $2 (v0). - -KVM Hypercalls Documentation -============================ - -The template for each hypercall is: -1. Hypercall name. -2. Architecture(s) -3. Status (deprecated, obsolete, active) -4. Purpose - -1. KVM_HC_VAPIC_POLL_IRQ ------------------------- - -:Architecture: x86 -:Status: active -:Purpose: Trigger guest exit so that the host can check for pending - interrupts on reentry. - -2. KVM_HC_MMU_OP ----------------- - -:Architecture: x86 -:Status: deprecated. -:Purpose: Support MMU operations such as writing to PTE, - flushing TLB, release PT. - -3. KVM_HC_FEATURES ------------------- - -:Architecture: PPC -:Status: active -:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid - used to enumerate which hypercalls are available. On PPC, either - device tree based lookup ( which is also what EPAPR dictates) - OR KVM specific enumeration mechanism (which is this hypercall) - can be used. - -4. KVM_HC_PPC_MAP_MAGIC_PAGE ----------------------------- - -:Architecture: PPC -:Status: active -:Purpose: To enable communication between the hypervisor and guest there is a - shared page that contains parts of supervisor visible register state. - The guest can map this shared page to access its supervisor register - through memory using this hypercall. - -5. KVM_HC_KICK_CPU ------------------- - -:Architecture: x86 -:Status: active -:Purpose: Hypercall used to wakeup a vcpu from HLT state -:Usage example: - A vcpu of a paravirtualized guest that is busywaiting in guest - kernel mode for an event to occur (ex: a spinlock to become available) can - execute HLT instruction once it has busy-waited for more than a threshold - time-interval. Execution of HLT instruction would cause the hypervisor to put - the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the - same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall, - specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0) - is used in the hypercall for future use. - - -6. KVM_HC_CLOCK_PAIRING ------------------------ -:Architecture: x86 -:Status: active -:Purpose: Hypercall used to synchronize host and guest clocks. - -Usage: - -a0: guest physical address where host copies -"struct kvm_clock_offset" structure. - -a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0) -is supported (corresponding to the host's CLOCK_REALTIME clock). - - :: - - struct kvm_clock_pairing { - __s64 sec; - __s64 nsec; - __u64 tsc; - __u32 flags; - __u32 pad[9]; - }; - - Where: - * sec: seconds from clock_type clock. - * nsec: nanoseconds from clock_type clock. - * tsc: guest TSC value used to calculate sec/nsec pair - * flags: flags, unused (0) at the moment. - -The hypercall lets a guest compute a precise timestamp across -host and guest. The guest can use the returned TSC value to -compute the CLOCK_REALTIME for its clock, at the same instant. - -Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, -or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. - -6. KVM_HC_SEND_IPI ------------------- - -:Architecture: x86 -:Status: active -:Purpose: Send IPIs to multiple vCPUs. - -- a0: lower part of the bitmap of destination APIC IDs -- a1: higher part of the bitmap of destination APIC IDs -- a2: the lowest APIC ID in bitmap -- a3: APIC ICR - -The hypercall lets a guest send multicast IPIs, with at most 128 -128 destinations per hypercall in 64-bit mode and 64 vCPUs per -hypercall in 32-bit mode. The destinations are represented by a -bitmap contained in the first two arguments (a0 and a1). Bit 0 of -a0 corresponds to the APIC ID in the third argument (a2), bit 1 -corresponds to the APIC ID a2+1, and so on. - -Returns the number of CPUs to which the IPIs were delivered successfully. - -7. KVM_HC_SCHED_YIELD ---------------------- - -:Architecture: x86 -:Status: active -:Purpose: Hypercall used to yield if the IPI target vCPU is preempted - -a0: destination APIC ID - -:Usage example: When sending a call-function IPI-many to vCPUs, yield if - any of the IPI target vCPUs was preempted. - -8. KVM_HC_MAP_GPA_RANGE -------------------------- -:Architecture: x86 -:Status: active -:Purpose: Request KVM to map a GPA range with the specified attributes. - -a0: the guest physical address of the start page -a1: the number of (4kb) pages (must be contiguous in GPA space) -a2: attributes - - Where 'attributes' : - * bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... - * bit 4 - plaintext = 0, encrypted = 1 - * bits 63:5 - reserved (must be zero) - -**Implementation note**: this hypercall is implemented in userspace via -the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability -before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In -addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace -must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst index b6833c7bb474..e0a2c74e1043 100644 --- a/Documentation/virt/kvm/index.rst +++ b/Documentation/virt/kvm/index.rst @@ -8,25 +8,13 @@ KVM :maxdepth: 2 api - amd-memory-encryption - cpuid - halt-polling - hypercalls - locking - mmu - msr - nested-vmx - ppc-pv - s390-diag - s390-pv - s390-pv-boot - timekeeping - vcpu-requests - - review-checklist + devices/index arm/index + s390/index + ppc-pv + x86/index - devices/index - - running-nested-guests + locking + vcpu-requests + review-checklist diff --git a/Documentation/virt/kvm/mmu.rst b/Documentation/virt/kvm/mmu.rst deleted file mode 100644 index 5b1ebad24c77..000000000000 --- a/Documentation/virt/kvm/mmu.rst +++ /dev/null @@ -1,480 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================== -The x86 kvm shadow mmu -====================== - -The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible -for presenting a standard x86 mmu to the guest, while translating guest -physical addresses to host physical addresses. - -The mmu code attempts to satisfy the following requirements: - -- correctness: - the guest should not be able to determine that it is running - on an emulated mmu except for timing (we attempt to comply - with the specification, not emulate the characteristics of - a particular implementation such as tlb size) -- security: - the guest must not be able to touch host memory not assigned - to it -- performance: - minimize the performance penalty imposed by the mmu -- scaling: - need to scale to large memory and large vcpu guests -- hardware: - support the full range of x86 virtualization hardware -- integration: - Linux memory management code must be in control of guest memory - so that swapping, page migration, page merging, transparent - hugepages, and similar features work without change -- dirty tracking: - report writes to guest memory to enable live migration - and framebuffer-based displays -- footprint: - keep the amount of pinned kernel memory low (most memory - should be shrinkable) -- reliability: - avoid multipage or GFP_ATOMIC allocations - -Acronyms -======== - -==== ==================================================================== -pfn host page frame number -hpa host physical address -hva host virtual address -gfn guest frame number -gpa guest physical address -gva guest virtual address -ngpa nested guest physical address -ngva nested guest virtual address -pte page table entry (used also to refer generically to paging structure - entries) -gpte guest pte (referring to gfns) -spte shadow pte (referring to pfns) -tdp two dimensional paging (vendor neutral term for NPT and EPT) -==== ==================================================================== - -Virtual and real hardware supported -=================================== - -The mmu supports first-generation mmu hardware, which allows an atomic switch -of the current paging mode and cr3 during guest entry, as well as -two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware -it exposes is the traditional 2/3/4 level x86 mmu, with support for global -pages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also -able to expose NPT capable hardware on NPT capable hosts. - -Translation -=========== - -The primary job of the mmu is to program the processor's mmu to translate -addresses for the guest. Different translations are required at different -times: - -- when guest paging is disabled, we translate guest physical addresses to - host physical addresses (gpa->hpa) -- when guest paging is enabled, we translate guest virtual addresses, to - guest physical addresses, to host physical addresses (gva->gpa->hpa) -- when the guest launches a guest of its own, we translate nested guest - virtual addresses, to nested guest physical addresses, to guest physical - addresses, to host physical addresses (ngva->ngpa->gpa->hpa) - -The primary challenge is to encode between 1 and 3 translations into hardware -that support only 1 (traditional) and 2 (tdp) translations. When the -number of required translations matches the hardware, the mmu operates in -direct mode; otherwise it operates in shadow mode (see below). - -Memory -====== - -Guest memory (gpa) is part of the user address space of the process that is -using kvm. Userspace defines the translation between guest addresses and user -addresses (gpa->hva); note that two gpas may alias to the same hva, but not -vice versa. - -These hvas may be backed using any method available to the host: anonymous -memory, file backed memory, and device memory. Memory might be paged by the -host at any time. - -Events -====== - -The mmu is driven by events, some from the guest, some from the host. - -Guest generated events: - -- writes to control registers (especially cr3) -- invlpg/invlpga instruction execution -- access to missing or protected translations - -Host generated events: - -- changes in the gpa->hpa translation (either through gpa->hva changes or - through hva->hpa changes) -- memory pressure (the shrinker) - -Shadow pages -============ - -The principal data structure is the shadow page, 'struct kvm_mmu_page'. A -shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A -shadow page may contain a mix of leaf and nonleaf sptes. - -A nonleaf spte allows the hardware mmu to reach the leaf pages and -is not related to a translation directly. It points to other shadow pages. - -A leaf spte corresponds to either one or two translations encoded into -one paging structure entry. These are always the lowest level of the -translation stack, with optional higher level translations left to NPT/EPT. -Leaf ptes point at guest pages. - -The following table shows translations encoded by leaf ptes, with higher-level -translations in parentheses: - - Non-nested guests:: - - nonpaging: gpa->hpa - paging: gva->gpa->hpa - paging, tdp: (gva->)gpa->hpa - - Nested guests:: - - non-tdp: ngva->gpa->hpa (*) - tdp: (ngva->)ngpa->gpa->hpa - - (*) the guest hypervisor will encode the ngva->gpa translation into its page - tables if npt is not present - -Shadow pages contain the following information: - role.level: - The level in the shadow paging hierarchy that this shadow page belongs to. - 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. - role.direct: - If set, leaf sptes reachable from this page are for a linear range. - Examples include real mode translation, large guest pages backed by small - host pages, and gpa->hpa translations when NPT or EPT is active. - The linear range starts at (gfn << PAGE_SHIFT) and its size is determined - by role.level (2MB for first level, 1GB for second level, 0.5TB for third - level, 256TB for fourth level) - If clear, this page corresponds to a guest page table denoted by the gfn - field. - role.quadrant: - When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit - sptes. That means a guest page table contains more ptes than the host, - so multiple shadow pages are needed to shadow one guest page. - For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the - first or second 512-gpte block in the guest page table. For second-level - page tables, each 32-bit gpte is converted to two 64-bit sptes - (since each first-level guest page is shadowed by two first-level - shadow pages) so role.quadrant takes values in the range 0..3. Each - quadrant maps 1GB virtual address space. - role.access: - Inherited guest access permissions from the parent ptes in the form uwx. - Note execute permission is positive, not negative. - role.invalid: - The page is invalid and should not be used. It is a root page that is - currently pinned (by a cpu hardware register pointing to it); once it is - unpinned it will be destroyed. - role.has_4_byte_gpte: - Reflects the size of the guest PTE for which the page is valid, i.e. '0' - if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use. - role.efer_nx: - Contains the value of efer.nx for which the page is valid. - role.cr0_wp: - Contains the value of cr0.wp for which the page is valid. - role.smep_andnot_wp: - Contains the value of cr4.smep && !cr0.wp for which the page is valid - (pages for which this is true are different from other pages; see the - treatment of cr0.wp=0 below). - role.smap_andnot_wp: - Contains the value of cr4.smap && !cr0.wp for which the page is valid - (pages for which this is true are different from other pages; see the - treatment of cr0.wp=0 below). - role.smm: - Is 1 if the page is valid in system management mode. This field - determines which of the kvm_memslots array was used to build this - shadow page; it is also used to go back from a struct kvm_mmu_page - to a memslot, through the kvm_memslots_for_spte_role macro and - __gfn_to_memslot. - role.ad_disabled: - Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D - bits before Haswell; shadow EPT page tables also cannot use A/D bits - if the L1 hypervisor does not enable them. - gfn: - Either the guest page table containing the translations shadowed by this - page, or the base page frame for linear translations. See role.direct. - spt: - A pageful of 64-bit sptes containing the translations for this page. - Accessed by both kvm and hardware. - The page pointed to by spt will have its page->private pointing back - at the shadow page structure. - sptes in spt point either at guest pages, or at lower-level shadow pages. - Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point - at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. - The spt array forms a DAG structure with the shadow page as a node, and - guest pages as leaves. - gfns: - An array of 512 guest frame numbers, one for each present pte. Used to - perform a reverse map from a pte to a gfn. When role.direct is set, any - element of this array can be calculated from the gfn field when used, in - this case, the array of gfns is not allocated. See role.direct and gfn. - root_count: - A counter keeping track of how many hardware registers (guest cr3 or - pdptrs) are now pointing at the page. While this counter is nonzero, the - page cannot be destroyed. See role.invalid. - parent_ptes: - The reverse mapping for the pte/ptes pointing at this page's spt. If - parent_ptes bit 0 is zero, only one spte points at this page and - parent_ptes points at this single spte, otherwise, there exists multiple - sptes pointing at this page and (parent_ptes & ~0x1) points at a data - structure with a list of parent sptes. - unsync: - If true, then the translations in this page may not match the guest's - translation. This is equivalent to the state of the tlb when a pte is - changed but before the tlb entry is flushed. Accordingly, unsync ptes - are synchronized when the guest executes invlpg or flushes its tlb by - other means. Valid for leaf pages. - unsync_children: - How many sptes in the page point at pages that are unsync (or have - unsynchronized children). - unsync_child_bitmap: - A bitmap indicating which sptes in spt point (directly or indirectly) at - pages that may be unsynchronized. Used to quickly locate all unsychronized - pages reachable from a given page. - clear_spte_count: - Only present on 32-bit hosts, where a 64-bit spte cannot be written - atomically. The reader uses this while running out of the MMU lock - to detect in-progress updates and retry them until the writer has - finished the write. - write_flooding_count: - A guest may write to a page table many times, causing a lot of - emulations if the page needs to be write-protected (see "Synchronized - and unsynchronized pages" below). Leaf pages can be unsynchronized - so that they do not trigger frequent emulation, but this is not - possible for non-leafs. This field counts the number of emulations - since the last time the page table was actually used; if emulation - is triggered too frequently on this page, KVM will unmap the page - to avoid emulation in the future. - -Reverse map -=========== - -The mmu maintains a reverse mapping whereby all ptes mapping a page can be -reached given its gfn. This is used, for example, when swapping out a page. - -Synchronized and unsynchronized pages -===================================== - -The guest uses two events to synchronize its tlb and page tables: tlb flushes -and page invalidations (invlpg). - -A tlb flush means that we need to synchronize all sptes reachable from the -guest's cr3. This is expensive, so we keep all guest page tables write -protected, and synchronize sptes to gptes when a gpte is written. - -A special case is when a guest page table is reachable from the current -guest cr3. In this case, the guest is obliged to issue an invlpg instruction -before using the translation. We take advantage of that by removing write -protection from the guest page, and allowing the guest to modify it freely. -We synchronize modified gptes when the guest invokes invlpg. This reduces -the amount of emulation we have to do when the guest modifies multiple gptes, -or when the a guest page is no longer used as a page table and is used for -random guest data. - -As a side effect we have to resynchronize all reachable unsynchronized shadow -pages on a tlb flush. - - -Reaction to events -================== - -- guest page fault (or npt page fault, or ept violation) - -This is the most complicated event. The cause of a page fault can be: - - - a true guest fault (the guest translation won't allow the access) (*) - - access to a missing translation - - access to a protected translation - - when logging dirty pages, memory is write protected - - synchronized shadow pages are write protected (*) - - access to untranslatable memory (mmio) - - (*) not applicable in direct mode - -Handling a page fault is performed as follows: - - - if the RSV bit of the error code is set, the page fault is caused by guest - accessing MMIO and cached MMIO information is available. - - - walk shadow page table - - check for valid generation number in the spte (see "Fast invalidation of - MMIO sptes" below) - - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and - vcpu->arch.mmio_gfn, and call the emulator - - - If both P bit and R/W bit of error code are set, this could possibly - be handled as a "fast page fault" (fixed without taking the MMU lock). See - the description in Documentation/virt/kvm/locking.rst. - - - if needed, walk the guest page tables to determine the guest translation - (gva->gpa or ngpa->gpa) - - - if permissions are insufficient, reflect the fault back to the guest - - - determine the host page - - - if this is an mmio request, there is no host page; cache the info to - vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn - - - walk the shadow page table to find the spte for the translation, - instantiating missing intermediate page tables as necessary - - - If this is an mmio request, cache the mmio info to the spte and set some - reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) - - - try to unsynchronize the page - - - if successful, we can let the guest continue and modify the gpte - - - emulate the instruction - - - if failed, unshadow the page and let the guest continue - - - update any translations that were modified by the instruction - -invlpg handling: - - - walk the shadow page hierarchy and drop affected translations - - try to reinstantiate the indicated translation in the hope that the - guest will use it in the near future - -Guest control register updates: - -- mov to cr3 - - - look up new shadow roots - - synchronize newly reachable shadow pages - -- mov to cr0/cr4/efer - - - set up mmu context for new paging mode - - look up new shadow roots - - synchronize newly reachable shadow pages - -Host translation updates: - - - mmu notifier called with updated hva - - look up affected sptes through reverse map - - drop (or update) translations - -Emulating cr0.wp -================ - -If tdp is not enabled, the host must keep cr0.wp=1 so page write protection -works for the guest kernel, not guest guest userspace. When the guest -cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, -we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the -semantics require allowing any guest kernel access plus user read access). - -We handle this by mapping the permissions to two possible sptes, depending -on fault type: - -- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, - disallows user access) -- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel - write access) - -(user write faults generate a #PF) - -In the first case there are two additional complications: - -- if CR4.SMEP is enabled: since we've turned the page into a kernel page, - the kernel may now execute it. We handle this by also setting spte.nx. - If we get a user fetch or read fault, we'll change spte.u=1 and - spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when - shadow paging is in use. -- if CR4.SMAP is disabled: since the page has been changed to a kernel - page, it can not be reused when CR4.SMAP is enabled. We set - CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, - here we do not care the case that CR4.SMAP is enabled since KVM will - directly inject #PF to guest due to failed permission check. - -To prevent an spte that was converted into a kernel page with cr0.wp=0 -from being written by the kernel after cr0.wp has changed to 1, we make -the value of cr0.wp part of the page role. This means that an spte created -with one value of cr0.wp cannot be used when cr0.wp has a different value - -it will simply be missed by the shadow page lookup code. A similar issue -exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after -changing cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep -is also made a part of the page role. - -Large pages -=========== - -The mmu supports all combinations of large and small guest and host pages. -Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as -two separate 2M pages, on both guest and host, since the mmu always uses PAE -paging. - -To instantiate a large spte, four constraints must be satisfied: - -- the spte must point to a large host page -- the guest pte must be a large pte of at least equivalent size (if tdp is - enabled, there is no guest pte and this condition is satisfied) -- if the spte will be writeable, the large page frame may not overlap any - write-protected pages -- the guest page must be wholly contained by a single memory slot - -To check the last two conditions, the mmu maintains a ->disallow_lpage set of -arrays for each memory slot and large page size. Every write protected page -causes its disallow_lpage to be incremented, thus preventing instantiation of -a large spte. The frames at the end of an unaligned memory slot have -artificially inflated ->disallow_lpages so they can never be instantiated. - -Fast invalidation of MMIO sptes -=============================== - -As mentioned in "Reaction to events" above, kvm will cache MMIO -information in leaf sptes. When a new memslot is added or an existing -memslot is changed, this information may become stale and needs to be -invalidated. This also needs to hold the MMU lock while walking all -shadow pages, and is made more scalable with a similar technique. - -MMIO sptes have a few spare bits, which are used to store a -generation number. The global generation number is stored in -kvm_memslots(kvm)->generation, and increased whenever guest memory info -changes. - -When KVM finds an MMIO spte, it checks the generation number of the spte. -If the generation number of the spte does not equal the global generation -number, it will ignore the cached MMIO information and handle the page -fault through the slow path. - -Since only 18 bits are used to store generation-number on mmio spte, all -pages are zapped when there is an overflow. - -Unfortunately, a single memory access might access kvm_memslots(kvm) multiple -times, the last one happening when the generation number is retrieved and -stored into the MMIO spte. Thus, the MMIO spte might be created based on -out-of-date information, but with an up-to-date generation number. - -To avoid this, the generation number is incremented again after synchronize_srcu -returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a -memslot update, while some SRCU readers might be using the old copy. We do not -want to use an MMIO sptes created with an odd generation number, and we can do -this without losing a bit in the MMIO spte. The "update in-progress" bit of the -generation is not stored in MMIO spte, and is so is implicitly zero when the -generation is extracted out of the spte. If KVM is unlucky and creates an MMIO -spte while an update is in-progress, the next access to the spte will always be -a cache miss. For example, a subsequent access during the update window will -miss due to the in-progress flag diverging, while an access after the update -window closes will have a higher generation number (as compared to the spte). - - -Further reading -=============== - -- NPT presentation from KVM Forum 2008 - https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst deleted file mode 100644 index 9315fc385fb0..000000000000 --- a/Documentation/virt/kvm/msr.rst +++ /dev/null @@ -1,391 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -================= -KVM-specific MSRs -================= - -:Author: Glauber Costa , Red Hat Inc, 2010 - -KVM makes use of some custom MSRs to service some requests. - -Custom MSRs have a range reserved for them, that goes from -0x4b564d00 to 0x4b564dff. There are MSRs outside this area, -but they are deprecated and their use is discouraged. - -Custom MSR list ---------------- - -The current supported Custom MSR list is: - -MSR_KVM_WALL_CLOCK_NEW: - 0x4b564d00 - -data: - 4-byte alignment physical address of a memory area which must be - in guest RAM. This memory is expected to hold a copy of the following - structure:: - - struct pvclock_wall_clock { - u32 version; - u32 sec; - u32 nsec; - } __attribute__((__packed__)); - - whose data will be filled in by the hypervisor. The hypervisor is only - guaranteed to update this data at the moment of MSR write. - Users that want to reliably query this information more than once have - to write more than once to this MSR. Fields have the following meanings: - - version: - guest has to check version before and after grabbing - time information and check that they are both equal and even. - An odd version indicates an in-progress update. - - sec: - number of seconds for wallclock at time of boot. - - nsec: - number of nanoseconds for wallclock at time of boot. - - In order to get the current wallclock time, the system_time from - MSR_KVM_SYSTEM_TIME_NEW needs to be added. - - Note that although MSRs are per-CPU entities, the effect of this - particular MSR is global. - - Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid - leaf prior to usage. - -MSR_KVM_SYSTEM_TIME_NEW: - 0x4b564d01 - -data: - 4-byte aligned physical address of a memory area which must be in - guest RAM, plus an enable bit in bit 0. This memory is expected to hold - a copy of the following structure:: - - struct pvclock_vcpu_time_info { - u32 version; - u32 pad0; - u64 tsc_timestamp; - u64 system_time; - u32 tsc_to_system_mul; - s8 tsc_shift; - u8 flags; - u8 pad[2]; - } __attribute__((__packed__)); /* 32 bytes */ - - whose data will be filled in by the hypervisor periodically. Only one - write, or registration, is needed for each VCPU. The interval between - updates of this structure is arbitrary and implementation-dependent. - The hypervisor may update this structure at any time it sees fit until - anything with bit0 == 0 is written to it. - - Fields have the following meanings: - - version: - guest has to check version before and after grabbing - time information and check that they are both equal and even. - An odd version indicates an in-progress update. - - tsc_timestamp: - the tsc value at the current VCPU at the time - of the update of this structure. Guests can subtract this value - from current tsc to derive a notion of elapsed time since the - structure update. - - system_time: - a host notion of monotonic time, including sleep - time at the time this structure was last updated. Unit is - nanoseconds. - - tsc_to_system_mul: - multiplier to be used when converting - tsc-related quantity to nanoseconds - - tsc_shift: - shift to be used when converting tsc-related - quantity to nanoseconds. This shift will ensure that - multiplication with tsc_to_system_mul does not overflow. - A positive value denotes a left shift, a negative value - a right shift. - - The conversion from tsc to nanoseconds involves an additional - right shift by 32 bits. With this information, guests can - derive per-CPU time by doing:: - - time = (current_tsc - tsc_timestamp) - if (tsc_shift >= 0) - time <<= tsc_shift; - else - time >>= -tsc_shift; - time = (time * tsc_to_system_mul) >> 32 - time = time + system_time - - flags: - bits in this field indicate extended capabilities - coordinated between the guest and the hypervisor. Availability - of specific flags has to be checked in 0x40000001 cpuid leaf. - Current flags are: - - - +-----------+--------------+----------------------------------+ - | flag bit | cpuid bit | meaning | - +-----------+--------------+----------------------------------+ - | | | time measures taken across | - | 0 | 24 | multiple cpus are guaranteed to | - | | | be monotonic | - +-----------+--------------+----------------------------------+ - | | | guest vcpu has been paused by | - | 1 | N/A | the host | - | | | See 4.70 in api.txt | - +-----------+--------------+----------------------------------+ - - Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid - leaf prior to usage. - - -MSR_KVM_WALL_CLOCK: - 0x11 - -data and functioning: - same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. - - This MSR falls outside the reserved KVM range and may be removed in the - future. Its usage is deprecated. - - Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid - leaf prior to usage. - -MSR_KVM_SYSTEM_TIME: - 0x12 - -data and functioning: - same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. - - This MSR falls outside the reserved KVM range and may be removed in the - future. Its usage is deprecated. - - Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid - leaf prior to usage. - - The suggested algorithm for detecting kvmclock presence is then:: - - if (!kvm_para_available()) /* refer to cpuid.txt */ - return NON_PRESENT; - - flags = cpuid_eax(0x40000001); - if (flags & 3) { - msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; - msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; - return PRESENT; - } else if (flags & 0) { - msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; - msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; - return PRESENT; - } else - return NON_PRESENT; - -MSR_KVM_ASYNC_PF_EN: - 0x4b564d02 - -data: - Asynchronous page fault (APF) control MSR. - - Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area - which must be in guest RAM and must be zeroed. This memory is expected - to hold a copy of the following structure:: - - struct kvm_vcpu_pv_apf_data { - /* Used for 'page not present' events delivered via #PF */ - __u32 flags; - - /* Used for 'page ready' events delivered via interrupt notification */ - __u32 token; - - __u8 pad[56]; - __u32 enabled; - }; - - Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1 - when asynchronous page faults are enabled on the vcpu, 0 when disabled. - Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in - cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as - #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is - present in CPUID. Bit 3 enables interrupt based delivery of 'page ready' - events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in - CPUID. - - 'Page not present' events are currently always delivered as synthetic - #PF exception. During delivery of these events APF CR2 register contains - a token that will be used to notify the guest when missing page becomes - available. Also, to make it possible to distinguish between real #PF and - APF, first 4 bytes of 64 byte memory location ('flags') will be written - to by the hypervisor at the time of injection. Only first bit of 'flags' - is currently supported, when set, it indicates that the guest is dealing - with asynchronous 'page not present' event. If during a page fault APF - 'flags' is '0' it means that this is regular page fault. Guest is - supposed to clear 'flags' when it is done handling #PF exception so the - next event can be delivered. - - Note, since APF 'page not present' events use the same exception vector - as regular page fault, guest must reset 'flags' to '0' before it does - something that can generate normal page fault. - - Bytes 5-7 of 64 byte memory location ('token') will be written to by the - hypervisor at the time of APF 'page ready' event injection. The content - of these bytes is a token which was previously delivered as 'page not - present' event. The event indicates the page in now available. Guest is - supposed to write '0' to 'token' when it is done handling 'page ready' - event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location; - writing to the MSR forces KVM to re-scan its queue and deliver the next - pending notification. - - Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page - ready' APF delivery needs to be written to before enabling APF mechanism - in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is - available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. - - Note, previously, 'page ready' events were delivered via the same #PF - exception as 'page not present' events but this is now deprecated. If - bit 3 (interrupt based delivery) is not set APF events are not delivered. - - If APF is disabled while there are outstanding APFs, they will - not be delivered. - - Currently 'page ready' APF events will be always delivered on the - same vcpu as 'page not present' event was, but guest should not rely on - that. - -MSR_KVM_STEAL_TIME: - 0x4b564d03 - -data: - 64-byte alignment physical address of a memory area which must be - in guest RAM, plus an enable bit in bit 0. This memory is expected to - hold a copy of the following structure:: - - struct kvm_steal_time { - __u64 steal; - __u32 version; - __u32 flags; - __u8 preempted; - __u8 u8_pad[3]; - __u32 pad[11]; - } - - whose data will be filled in by the hypervisor periodically. Only one - write, or registration, is needed for each VCPU. The interval between - updates of this structure is arbitrary and implementation-dependent. - The hypervisor may update this structure at any time it sees fit until - anything with bit0 == 0 is written to it. Guest is required to make sure - this structure is initialized to zero. - - Fields have the following meanings: - - version: - a sequence counter. In other words, guest has to check - this field before and after grabbing time information and make - sure they are both equal and even. An odd version indicates an - in-progress update. - - flags: - At this point, always zero. May be used to indicate - changes in this structure in the future. - - steal: - the amount of time in which this vCPU did not run, in - nanoseconds. Time during which the vcpu is idle, will not be - reported as steal time. - - preempted: - indicate the vCPU who owns this struct is running or - not. Non-zero values mean the vCPU has been preempted. Zero - means the vCPU is not preempted. NOTE, it is always zero if the - the hypervisor doesn't support this field. - -MSR_KVM_EOI_EN: - 0x4b564d04 - -data: - Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 - when disabled. Bit 1 is reserved and must be zero. When PV end of - interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned - physical address of a 4 byte memory area which must be in guest RAM and - must be zeroed. - - The first, least significant bit of 4 byte memory location will be - written to by the hypervisor, typically at the time of interrupt - injection. Value of 1 means that guest can skip writing EOI to the apic - (using MSR or MMIO write); instead, it is sufficient to signal - EOI by clearing the bit in guest memory - this location will - later be polled by the hypervisor. - Value of 0 means that the EOI write is required. - - It is always safe for the guest to ignore the optimization and perform - the APIC EOI write anyway. - - Hypervisor is guaranteed to only modify this least - significant bit while in the current VCPU context, this means that - guest does not need to use either lock prefix or memory ordering - primitives to synchronise with the hypervisor. - - However, hypervisor can set and clear this memory bit at any time: - therefore to make sure hypervisor does not interrupt the - guest and clear the least significant bit in the memory area - in the window between guest testing it to detect - whether it can skip EOI apic write and between guest - clearing it to signal EOI to the hypervisor, - guest must both read the least significant bit in the memory area and - clear it using a single CPU instruction, such as test and clear, or - compare and exchange. - -MSR_KVM_POLL_CONTROL: - 0x4b564d05 - - Control host-side polling. - -data: - Bit 0 enables (1) or disables (0) host-side HLT polling logic. - - KVM guests can request the host not to poll on HLT, for example if - they are performing polling themselves. - -MSR_KVM_ASYNC_PF_INT: - 0x4b564d06 - -data: - Second asynchronous page fault (APF) control MSR. - - Bits 0-7: APIC vector for delivery of 'page ready' APF events. - Bits 8-63: Reserved - - Interrupt vector for asynchnonous 'page ready' notifications delivery. - The vector has to be set up before asynchronous page fault mechanism - is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if - KVM_FEATURE_ASYNC_PF_INT is present in CPUID. - -MSR_KVM_ASYNC_PF_ACK: - 0x4b564d07 - -data: - Asynchronous page fault (APF) acknowledgment. - - When the guest is done processing 'page ready' APF event and 'token' - field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to - write '1' to bit 0 of the MSR, this causes the host to re-scan its queue - and check if there are more notifications pending. The MSR is available - if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. - -MSR_KVM_MIGRATION_CONTROL: - 0x4b564d08 - -data: - This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in - CPUID. Bit 0 represents whether live migration of the guest is allowed. - - When a guest is started, bit 0 will be 0 if the guest has encrypted - memory and 1 if the guest does not have encrypted memory. If the - guest is communicating page encryption status to the host using the - ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to - allow live migration of the guest. diff --git a/Documentation/virt/kvm/nested-vmx.rst b/Documentation/virt/kvm/nested-vmx.rst deleted file mode 100644 index ac2095d41f02..000000000000 --- a/Documentation/virt/kvm/nested-vmx.rst +++ /dev/null @@ -1,244 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -========== -Nested VMX -========== - -Overview ---------- - -On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) -to easily and efficiently run guest operating systems. Normally, these guests -*cannot* themselves be hypervisors running their own guests, because in VMX, -guests cannot use VMX instructions. - -The "Nested VMX" feature adds this missing capability - of running guest -hypervisors (which use VMX) with their own nested guests. It does so by -allowing a guest to use VMX instructions, and correctly and efficiently -emulating them using the single level of VMX available in the hardware. - -We describe in much greater detail the theory behind the nested VMX feature, -its implementation and its performance characteristics, in the OSDI 2010 paper -"The Turtles Project: Design and Implementation of Nested Virtualization", -available at: - - https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf - - -Terminology ------------ - -Single-level virtualization has two levels - the host (KVM) and the guests. -In nested virtualization, we have three levels: The host (KVM), which we call -L0, the guest hypervisor, which we call L1, and its nested guest, which we -call L2. - - -Running nested VMX ------------------- - -The nested VMX feature is enabled by default since Linux kernel v4.20. For -older Linux kernel, it can be enabled by giving the "nested=1" option to the -kvm-intel module. - - -No modifications are required to user space (qemu). However, qemu's default -emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be -explicitly enabled, by giving qemu one of the following options: - - - cpu host (emulated CPU has all features of the real CPU) - - - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) - - -ABIs ----- - -Nested VMX aims to present a standard and (eventually) fully-functional VMX -implementation for the a guest hypervisor to use. As such, the official -specification of the ABI that it provides is Intel's VMX specification, -namely volume 3B of their "Intel 64 and IA-32 Architectures Software -Developer's Manual". Not all of VMX's features are currently fully supported, -but the goal is to eventually support them all, starting with the VMX features -which are used in practice by popular hypervisors (KVM and others). - -As a VMX implementation, nested VMX presents a VMCS structure to L1. -As mandated by the spec, other than the two fields revision_id and abort, -this structure is *opaque* to its user, who is not supposed to know or care -about its internal structure. Rather, the structure is accessed through the -VMREAD and VMWRITE instructions. -Still, for debugging purposes, KVM developers might be interested to know the -internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. - -The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we -also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS -which L0 builds to actually run L2 - how this is done is explained in the -aforementioned paper. - -For convenience, we repeat the content of struct vmcs12 here. If the internals -of this structure changes, this can break live migration across KVM versions. -VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner -struct shadow_vmcs is ever changed. - -:: - - typedef u64 natural_width; - struct __packed vmcs12 { - /* According to the Intel spec, a VMCS region must start with - * these two user-visible fields */ - u32 revision_id; - u32 abort; - - u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ - u32 padding[7]; /* room for future expansion */ - - u64 io_bitmap_a; - u64 io_bitmap_b; - u64 msr_bitmap; - u64 vm_exit_msr_store_addr; - u64 vm_exit_msr_load_addr; - u64 vm_entry_msr_load_addr; - u64 tsc_offset; - u64 virtual_apic_page_addr; - u64 apic_access_addr; - u64 ept_pointer; - u64 guest_physical_address; - u64 vmcs_link_pointer; - u64 guest_ia32_debugctl; - u64 guest_ia32_pat; - u64 guest_ia32_efer; - u64 guest_pdptr0; - u64 guest_pdptr1; - u64 guest_pdptr2; - u64 guest_pdptr3; - u64 host_ia32_pat; - u64 host_ia32_efer; - u64 padding64[8]; /* room for future expansion */ - natural_width cr0_guest_host_mask; - natural_width cr4_guest_host_mask; - natural_width cr0_read_shadow; - natural_width cr4_read_shadow; - natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ - natural_width exit_qualification; - natural_width guest_linear_address; - natural_width guest_cr0; - natural_width guest_cr3; - natural_width guest_cr4; - natural_width guest_es_base; - natural_width guest_cs_base; - natural_width guest_ss_base; - natural_width guest_ds_base; - natural_width guest_fs_base; - natural_width guest_gs_base; - natural_width guest_ldtr_base; - natural_width guest_tr_base; - natural_width guest_gdtr_base; - natural_width guest_idtr_base; - natural_width guest_dr7; - natural_width guest_rsp; - natural_width guest_rip; - natural_width guest_rflags; - natural_width guest_pending_dbg_exceptions; - natural_width guest_sysenter_esp; - natural_width guest_sysenter_eip; - natural_width host_cr0; - natural_width host_cr3; - natural_width host_cr4; - natural_width host_fs_base; - natural_width host_gs_base; - natural_width host_tr_base; - natural_width host_gdtr_base; - natural_width host_idtr_base; - natural_width host_ia32_sysenter_esp; - natural_width host_ia32_sysenter_eip; - natural_width host_rsp; - natural_width host_rip; - natural_width paddingl[8]; /* room for future expansion */ - u32 pin_based_vm_exec_control; - u32 cpu_based_vm_exec_control; - u32 exception_bitmap; - u32 page_fault_error_code_mask; - u32 page_fault_error_code_match; - u32 cr3_target_count; - u32 vm_exit_controls; - u32 vm_exit_msr_store_count; - u32 vm_exit_msr_load_count; - u32 vm_entry_controls; - u32 vm_entry_msr_load_count; - u32 vm_entry_intr_info_field; - u32 vm_entry_exception_error_code; - u32 vm_entry_instruction_len; - u32 tpr_threshold; - u32 secondary_vm_exec_control; - u32 vm_instruction_error; - u32 vm_exit_reason; - u32 vm_exit_intr_info; - u32 vm_exit_intr_error_code; - u32 idt_vectoring_info_field; - u32 idt_vectoring_error_code; - u32 vm_exit_instruction_len; - u32 vmx_instruction_info; - u32 guest_es_limit; - u32 guest_cs_limit; - u32 guest_ss_limit; - u32 guest_ds_limit; - u32 guest_fs_limit; - u32 guest_gs_limit; - u32 guest_ldtr_limit; - u32 guest_tr_limit; - u32 guest_gdtr_limit; - u32 guest_idtr_limit; - u32 guest_es_ar_bytes; - u32 guest_cs_ar_bytes; - u32 guest_ss_ar_bytes; - u32 guest_ds_ar_bytes; - u32 guest_fs_ar_bytes; - u32 guest_gs_ar_bytes; - u32 guest_ldtr_ar_bytes; - u32 guest_tr_ar_bytes; - u32 guest_interruptibility_info; - u32 guest_activity_state; - u32 guest_sysenter_cs; - u32 host_ia32_sysenter_cs; - u32 padding32[8]; /* room for future expansion */ - u16 virtual_processor_id; - u16 guest_es_selector; - u16 guest_cs_selector; - u16 guest_ss_selector; - u16 guest_ds_selector; - u16 guest_fs_selector; - u16 guest_gs_selector; - u16 guest_ldtr_selector; - u16 guest_tr_selector; - u16 host_es_selector; - u16 host_cs_selector; - u16 host_ss_selector; - u16 host_ds_selector; - u16 host_fs_selector; - u16 host_gs_selector; - u16 host_tr_selector; - }; - - -Authors -------- - -These patches were written by: - - Abel Gordon, abelg il.ibm.com - - Nadav Har'El, nyh il.ibm.com - - Orit Wasserman, oritw il.ibm.com - - Ben-Ami Yassor, benami il.ibm.com - - Muli Ben-Yehuda, muli il.ibm.com - -With contributions by: - - Anthony Liguori, aliguori us.ibm.com - - Mike Day, mdday us.ibm.com - - Michael Factor, factor il.ibm.com - - Zvi Dubitzky, dubi il.ibm.com - -And valuable reviews by: - - Avi Kivity, avi redhat.com - - Gleb Natapov, gleb redhat.com - - Marcelo Tosatti, mtosatti redhat.com - - Kevin Tian, kevin.tian intel.com - - and others. diff --git a/Documentation/virt/kvm/running-nested-guests.rst b/Documentation/virt/kvm/running-nested-guests.rst deleted file mode 100644 index bd70c69468ae..000000000000 --- a/Documentation/virt/kvm/running-nested-guests.rst +++ /dev/null @@ -1,276 +0,0 @@ -============================== -Running nested guests with KVM -============================== - -A nested guest is the ability to run a guest inside another guest (it -can be KVM-based or a different hypervisor). The straightforward -example is a KVM guest that in turn runs on a KVM guest (the rest of -this document is built on this example):: - - .----------------. .----------------. - | | | | - | L2 | | L2 | - | (Nested Guest) | | (Nested Guest) | - | | | | - |----------------'--'----------------| - | | - | L1 (Guest Hypervisor) | - | KVM (/dev/kvm) | - | | - .------------------------------------------------------. - | L0 (Host Hypervisor) | - | KVM (/dev/kvm) | - |------------------------------------------------------| - | Hardware (with virtualization extensions) | - '------------------------------------------------------' - -Terminology: - -- L0 – level-0; the bare metal host, running KVM - -- L1 – level-1 guest; a VM running on L0; also called the "guest - hypervisor", as it itself is capable of running KVM. - -- L2 – level-2 guest; a VM running on L1, this is the "nested guest" - -.. note:: The above diagram is modelled after the x86 architecture; - s390x, ppc64 and other architectures are likely to have - a different design for nesting. - - For example, s390x always has an LPAR (LogicalPARtition) - hypervisor running on bare metal, adding another layer and - resulting in at least four levels in a nested setup — L0 (bare - metal, running the LPAR hypervisor), L1 (host hypervisor), L2 - (guest hypervisor), L3 (nested guest). - - This document will stick with the three-level terminology (L0, - L1, and L2) for all architectures; and will largely focus on - x86. - - -Use Cases ---------- - -There are several scenarios where nested KVM can be useful, to name a -few: - -- As a developer, you want to test your software on different operating - systems (OSes). Instead of renting multiple VMs from a Cloud - Provider, using nested KVM lets you rent a large enough "guest - hypervisor" (level-1 guest). This in turn allows you to create - multiple nested guests (level-2 guests), running different OSes, on - which you can develop and test your software. - -- Live migration of "guest hypervisors" and their nested guests, for - load balancing, disaster recovery, etc. - -- VM image creation tools (e.g. ``virt-install``, etc) often run - their own VM, and users expect these to work inside a VM. - -- Some OSes use virtualization internally for security (e.g. to let - applications run safely in isolation). - - -Enabling "nested" (x86) ------------------------ - -From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled -by default for Intel and AMD. (Though your Linux distribution might -override this default.) - -In case you are running a Linux kernel older than v4.19, to enable -nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To -persist this setting across reboots, you can add it in a config file, as -shown below: - -1. On the bare metal host (L0), list the kernel modules and ensure that - the KVM modules:: - - $ lsmod | grep -i kvm - kvm_intel 133627 0 - kvm 435079 1 kvm_intel - -2. Show information for ``kvm_intel`` module:: - - $ modinfo kvm_intel | grep -i nested - parm: nested:bool - -3. For the nested KVM configuration to persist across reboots, place the - below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it - doesn't exist):: - - $ cat /etc/modprobe.d/kvm_intel.conf - options kvm-intel nested=y - -4. Unload and re-load the KVM Intel module:: - - $ sudo rmmod kvm-intel - $ sudo modprobe kvm-intel - -5. Verify if the ``nested`` parameter for KVM is enabled:: - - $ cat /sys/module/kvm_intel/parameters/nested - Y - -For AMD hosts, the process is the same as above, except that the module -name is ``kvm-amd``. - - -Additional nested-related kernel parameters (x86) -------------------------------------------------- - -If your hardware is sufficiently advanced (Intel Haswell processor or -higher, which has newer hardware virt extensions), the following -additional features will also be enabled by default: "Shadow VMCS -(Virtual Machine Control Structure)", APIC Virtualization on your bare -metal host (L0). Parameters for Intel hosts:: - - $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs - Y - - $ cat /sys/module/kvm_intel/parameters/enable_apicv - Y - - $ cat /sys/module/kvm_intel/parameters/ept - Y - -.. note:: If you suspect your L2 (i.e. nested guest) is running slower, - ensure the above are enabled (particularly - ``enable_shadow_vmcs`` and ``ept``). - - -Starting a nested guest (x86) ------------------------------ - -Once your bare metal host (L0) is configured for nesting, you should be -able to start an L1 guest with:: - - $ qemu-kvm -cpu host [...] - -The above will pass through the host CPU's capabilities as-is to the -gues); or for better live migration compatibility, use a named CPU -model supported by QEMU. e.g.:: - - $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on - -then the guest hypervisor will subsequently be capable of running a -nested guest with accelerated KVM. - - -Enabling "nested" (s390x) -------------------------- - -1. On the host hypervisor (L0), enable the ``nested`` parameter on - s390x:: - - $ rmmod kvm - $ modprobe kvm nested=1 - -.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive - with the ``nested`` paramter — i.e. to be able to enable - ``nested``, the ``hpage`` parameter *must* be disabled. - -2. The guest hypervisor (L1) must be provided with the ``sie`` CPU - feature — with QEMU, this can be done by using "host passthrough" - (via the command-line ``-cpu host``). - -3. Now the KVM module can be loaded in the L1 (guest hypervisor):: - - $ modprobe kvm - - -Live migration with nested KVM ------------------------------- - -Migrating an L1 guest, with a *live* nested guest in it, to another -bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for -Intel x86 systems, and even on older versions for s390x. - -On AMD systems, once an L1 guest has started an L2 guest, the L1 guest -should no longer be migrated or saved (refer to QEMU documentation on -"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate -or save-and-load an L1 guest while an L2 guest is running will result in -undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a -kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 -guest can no longer be considered stable or secure, and must be restarted. -Migrating an L1 guest merely configured to support nesting, while not -actually running L2 guests, is expected to function normally even on AMD -systems but may fail once guests are started. - -Migrating an L2 guest is always expected to succeed, so all the following -scenarios should work even on AMD systems: - -- Migrating a nested guest (L2) to another L1 guest on the *same* bare - metal host. - -- Migrating a nested guest (L2) to another L1 guest on a *different* - bare metal host. - -- Migrating a nested guest (L2) to a bare metal host. - -Reporting bugs from nested setups ------------------------------------ - -Debugging "nested" problems can involve sifting through log files across -L0, L1 and L2; this can result in tedious back-n-forth between the bug -reporter and the bug fixer. - -- Mention that you are in a "nested" setup. If you are running any kind - of "nesting" at all, say so. Unfortunately, this needs to be called - out because when reporting bugs, people tend to forget to even - *mention* that they're using nested virtualization. - -- Ensure you are actually running KVM on KVM. Sometimes people do not - have KVM enabled for their guest hypervisor (L1), which results in - them running with pure emulation or what QEMU calls it as "TCG", but - they think they're running nested KVM. Thus confusing "nested Virt" - (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). - -Information to collect (generic) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The following is not an exhaustive list, but a very good starting point: - - - Kernel, libvirt, and QEMU version from L0 - - - Kernel, libvirt and QEMU version from L1 - - - QEMU command-line of L1 -- when using libvirt, you'll find it here: - ``/var/log/libvirt/qemu/instance.log`` - - - QEMU command-line of L2 -- as above, when using libvirt, get the - complete libvirt-generated QEMU command-line - - - ``cat /sys/cpuinfo`` from L0 - - - ``cat /sys/cpuinfo`` from L1 - - - ``lscpu`` from L0 - - - ``lscpu`` from L1 - - - Full ``dmesg`` output from L0 - - - Full ``dmesg`` output from L1 - -x86-specific info to collect -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Both the below commands, ``x86info`` and ``dmidecode``, should be -available on most Linux distributions with the same name: - - - Output of: ``x86info -a`` from L0 - - - Output of: ``x86info -a`` from L1 - - - Output of: ``dmidecode`` from L0 - - - Output of: ``dmidecode`` from L1 - -s390x-specific info to collect -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Along with the earlier mentioned generic details, the below is -also recommended: - - - ``/proc/sysinfo`` from L1; this will also include the info from L0 diff --git a/Documentation/virt/kvm/s390-diag.rst b/Documentation/virt/kvm/s390-diag.rst deleted file mode 100644 index ca85f030eb0b..000000000000 --- a/Documentation/virt/kvm/s390-diag.rst +++ /dev/null @@ -1,119 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -============================= -The s390 DIAGNOSE call on KVM -============================= - -KVM on s390 supports the DIAGNOSE call for making hypercalls, both for -native hypercalls and for selected hypercalls found on other s390 -hypervisors. - -Note that bits are numbered as by the usual s390 convention (most significant -bit on the left). - - -General remarks ---------------- - -DIAGNOSE calls by the guest cause a mandatory intercept. This implies -all supported DIAGNOSE calls need to be handled by either KVM or its -userspace. - -All DIAGNOSE calls supported by KVM use the RS-a format:: - - -------------------------------------- - | '83' | R1 | R3 | B2 | D2 | - -------------------------------------- - 0 8 12 16 20 31 - -The second-operand address (obtained by the base/displacement calculation) -is not used to address data. Instead, bits 48-63 of this address specify -the function code, and bits 0-47 are ignored. - -The supported DIAGNOSE function codes vary by the userspace used. For -DIAGNOSE function codes not specific to KVM, please refer to the -documentation for the s390 hypervisors defining them. - - -DIAGNOSE function code 'X'500' - KVM virtio functions ------------------------------------------------------ - -If the function code specifies 0x500, various virtio-related functions -are performed. - -General register 1 contains the virtio subfunction code. Supported -virtio subfunctions depend on KVM's userspace. Generally, userspace -provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3). - -Upon completion of the DIAGNOSE instruction, general register 2 contains -the function's return code, which is either a return code or a subcode -specific value. - -Subcode 0 - s390-virtio notification and early console printk - Handled by userspace. - -Subcode 1 - s390-virtio reset - Handled by userspace. - -Subcode 2 - s390-virtio set status - Handled by userspace. - -Subcode 3 - virtio-ccw notification - Handled by either userspace or KVM (ioeventfd case). - - General register 2 contains a subchannel-identification word denoting - the subchannel of the virtio-ccw proxy device to be notified. - - General register 3 contains the number of the virtqueue to be notified. - - General register 4 contains a 64bit identifier for KVM usage (the - kvm_io_bus cookie). If general register 4 does not contain a valid - identifier, it is ignored. - - After completion of the DIAGNOSE call, general register 2 may contain - a 64bit identifier (in the kvm_io_bus cookie case), or a negative - error value, if an internal error occurred. - - See also the virtio standard for a discussion of this hypercall. - - -DIAGNOSE function code 'X'501 - KVM breakpoint ----------------------------------------------- - -If the function code specifies 0x501, breakpoint functions may be performed. -This function code is handled by userspace. - -This diagnose function code has no subfunctions and uses no parameters. - - -DIAGNOSE function code 'X'9C - Voluntary Time Slice Yield ---------------------------------------------------------- - -General register 1 contains the target CPU address. - -In a guest of a hypervisor like LPAR, KVM or z/VM using shared host CPUs, -DIAGNOSE with function code 0x9c may improve system performance by -yielding the host CPU on which the guest CPU is running to be assigned -to another guest CPU, preferably the logical CPU containing the specified -target CPU. - - -DIAG 'X'9C forwarding -+++++++++++++++++++++ - -The guest may send a DIAGNOSE 0x9c in order to yield to a certain -other vcpu. An example is a Linux guest that tries to yield to the vcpu -that is currently holding a spinlock, but not running. - -However, on the host the real cpu backing the vcpu may itself not be -running. -Forwarding the DIAGNOSE 0x9c initially sent by the guest to yield to -the backing cpu will hopefully cause that cpu, and thus subsequently -the guest's vcpu, to be scheduled. - - -diag9c_forwarding_hz - KVM kernel parameter allowing to specify the maximum number of DIAGNOSE - 0x9c forwarding per second in the purpose of avoiding a DIAGNOSE 0x9c - forwarding storm. - A value of 0 turns the forwarding off. diff --git a/Documentation/virt/kvm/s390-pv-boot.rst b/Documentation/virt/kvm/s390-pv-boot.rst deleted file mode 100644 index 73a6083cb5e7..000000000000 --- a/Documentation/virt/kvm/s390-pv-boot.rst +++ /dev/null @@ -1,84 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================================== -s390 (IBM Z) Boot/IPL of Protected VMs -====================================== - -Summary -------- -The memory of Protected Virtual Machines (PVMs) is not accessible to -I/O or the hypervisor. In those cases where the hypervisor needs to -access the memory of a PVM, that memory must be made accessible. -Memory made accessible to the hypervisor will be encrypted. See -Documentation/virt/kvm/s390-pv.rst for details." - -On IPL (boot) a small plaintext bootloader is started, which provides -information about the encrypted components and necessary metadata to -KVM to decrypt the protected virtual machine. - -Based on this data, KVM will make the protected virtual machine known -to the Ultravisor (UV) and instruct it to secure the memory of the -PVM, decrypt the components and verify the data and address list -hashes, to ensure integrity. Afterwards KVM can run the PVM via the -SIE instruction which the UV will intercept and execute on KVM's -behalf. - -As the guest image is just like an opaque kernel image that does the -switch into PV mode itself, the user can load encrypted guest -executables and data via every available method (network, dasd, scsi, -direct kernel, ...) without the need to change the boot process. - - -Diag308 -------- -This diagnose instruction is the basic mechanism to handle IPL and -related operations for virtual machines. The VM can set and retrieve -IPL information blocks, that specify the IPL method/devices and -request VM memory and subsystem resets, as well as IPLs. - -For PVMs this concept has been extended with new subcodes: - -Subcode 8: Set an IPL Information Block of type 5 (information block -for PVMs) -Subcode 9: Store the saved block in guest memory -Subcode 10: Move into Protected Virtualization mode - -The new PV load-device-specific-parameters field specifies all data -that is necessary to move into PV mode. - -* PV Header origin -* PV Header length -* List of Components composed of - * AES-XTS Tweak prefix - * Origin - * Size - -The PV header contains the keys and hashes, which the UV will use to -decrypt and verify the PV, as well as control flags and a start PSW. - -The components are for instance an encrypted kernel, kernel parameters -and initrd. The components are decrypted by the UV. - -After the initial import of the encrypted data, all defined pages will -contain the guest content. All non-specified pages will start out as -zero pages on first access. - - -When running in protected virtualization mode, some subcodes will result in -exceptions or return error codes. - -Subcodes 4 and 7, which specify operations that do not clear the guest -memory, will result in specification exceptions. This is because the -UV will clear all memory when a secure VM is removed, and therefore -non-clearing IPL subcodes are not allowed. - -Subcodes 8, 9, 10 will result in specification exceptions. -Re-IPL into a protected mode is only possible via a detour into non -protected mode. - -Keys ----- -Every CEC will have a unique public key to enable tooling to build -encrypted images. -See `s390-tools `_ -for the tooling. diff --git a/Documentation/virt/kvm/s390-pv.rst b/Documentation/virt/kvm/s390-pv.rst deleted file mode 100644 index 8e41a3b63fa5..000000000000 --- a/Documentation/virt/kvm/s390-pv.rst +++ /dev/null @@ -1,116 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -========================================= -s390 (IBM Z) Ultravisor and Protected VMs -========================================= - -Summary -------- -Protected virtual machines (PVM) are KVM VMs that do not allow KVM to -access VM state like guest memory or guest registers. Instead, the -PVMs are mostly managed by a new entity called Ultravisor (UV). The UV -provides an API that can be used by PVMs and KVM to request management -actions. - -Each guest starts in non-protected mode and then may make a request to -transition into protected mode. On transition, KVM registers the guest -and its VCPUs with the Ultravisor and prepares everything for running -it. - -The Ultravisor will secure and decrypt the guest's boot memory -(i.e. kernel/initrd). It will safeguard state changes like VCPU -starts/stops and injected interrupts while the guest is running. - -As access to the guest's state, such as the SIE state description, is -normally needed to be able to run a VM, some changes have been made in -the behavior of the SIE instruction. A new format 4 state description -has been introduced, where some fields have different meanings for a -PVM. SIE exits are minimized as much as possible to improve speed and -reduce exposed guest state. - - -Interrupt injection -------------------- -Interrupt injection is safeguarded by the Ultravisor. As KVM doesn't -have access to the VCPUs' lowcores, injection is handled via the -format 4 state description. - -Machine check, external, IO and restart interruptions each can be -injected on SIE entry via a bit in the interrupt injection control -field (offset 0x54). If the guest cpu is not enabled for the interrupt -at the time of injection, a validity interception is recognized. The -format 4 state description contains fields in the interception data -block where data associated with the interrupt can be transported. - -Program and Service Call exceptions have another layer of -safeguarding; they can only be injected for instructions that have -been intercepted into KVM. The exceptions need to be a valid outcome -of an instruction emulation by KVM, e.g. we can never inject a -addressing exception as they are reported by SIE since KVM has no -access to the guest memory. - - -Mask notification interceptions -------------------------------- -KVM cannot intercept lctl(g) and lpsw(e) anymore in order to be -notified when a PVM enables a certain class of interrupt. As a -replacement, two new interception codes have been introduced: One -indicating that the contents of CRs 0, 6, or 14 have been changed, -indicating different interruption subclasses; and one indicating that -PSW bit 13 has been changed, indicating that a machine check -intervention was requested and those are now enabled. - -Instruction emulation ---------------------- -With the format 4 state description for PVMs, the SIE instruction already -interprets more instructions than it does with format 2. It is not able -to interpret every instruction, but needs to hand some tasks to KVM; -therefore, the SIE and the ultravisor safeguard emulation inputs and outputs. - -The control structures associated with SIE provide the Secure -Instruction Data Area (SIDA), the Interception Parameters (IP) and the -Secure Interception General Register Save Area. Guest GRs and most of -the instruction data, such as I/O data structures, are filtered. -Instruction data is copied to and from the SIDA when needed. Guest -GRs are put into / retrieved from the Secure Interception General -Register Save Area. - -Only GR values needed to emulate an instruction will be copied into this -save area and the real register numbers will be hidden. - -The Interception Parameters state description field still contains -the bytes of the instruction text, but with pre-set register values -instead of the actual ones. I.e. each instruction always uses the same -instruction text, in order not to leak guest instruction text. -This also implies that the register content that a guest had in r -may be in r from the hypervisor's point of view. - -The Secure Instruction Data Area contains instruction storage -data. Instruction data, i.e. data being referenced by an instruction -like the SCCB for sclp, is moved via the SIDA. When an instruction is -intercepted, the SIE will only allow data and program interrupts for -this instruction to be moved to the guest via the two data areas -discussed before. Other data is either ignored or results in validity -interceptions. - - -Instruction emulation interceptions ------------------------------------ -There are two types of SIE secure instruction intercepts: the normal -and the notification type. Normal secure instruction intercepts will -make the guest pending for instruction completion of the intercepted -instruction type, i.e. on SIE entry it is attempted to complete -emulation of the instruction with the data provided by KVM. That might -be a program exception or instruction completion. - -The notification type intercepts inform KVM about guest environment -changes due to guest instruction interpretation. Such an interception -is recognized, for example, for the store prefix instruction to provide -the new lowcore location. On SIE reentry, any KVM data in the data areas -is ignored and execution continues as if the guest instruction had -completed. For that reason KVM is not allowed to inject a program -interrupt. - -Links ------ -`KVM Forum 2019 presentation `_ diff --git a/Documentation/virt/kvm/s390/index.rst b/Documentation/virt/kvm/s390/index.rst new file mode 100644 index 000000000000..605f488f0cc5 --- /dev/null +++ b/Documentation/virt/kvm/s390/index.rst @@ -0,0 +1,12 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +KVM for s390 systems +==================== + +.. toctree:: + :maxdepth: 2 + + s390-diag + s390-pv + s390-pv-boot diff --git a/Documentation/virt/kvm/s390/s390-diag.rst b/Documentation/virt/kvm/s390/s390-diag.rst new file mode 100644 index 000000000000..ca85f030eb0b --- /dev/null +++ b/Documentation/virt/kvm/s390/s390-diag.rst @@ -0,0 +1,119 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +The s390 DIAGNOSE call on KVM +============================= + +KVM on s390 supports the DIAGNOSE call for making hypercalls, both for +native hypercalls and for selected hypercalls found on other s390 +hypervisors. + +Note that bits are numbered as by the usual s390 convention (most significant +bit on the left). + + +General remarks +--------------- + +DIAGNOSE calls by the guest cause a mandatory intercept. This implies +all supported DIAGNOSE calls need to be handled by either KVM or its +userspace. + +All DIAGNOSE calls supported by KVM use the RS-a format:: + + -------------------------------------- + | '83' | R1 | R3 | B2 | D2 | + -------------------------------------- + 0 8 12 16 20 31 + +The second-operand address (obtained by the base/displacement calculation) +is not used to address data. Instead, bits 48-63 of this address specify +the function code, and bits 0-47 are ignored. + +The supported DIAGNOSE function codes vary by the userspace used. For +DIAGNOSE function codes not specific to KVM, please refer to the +documentation for the s390 hypervisors defining them. + + +DIAGNOSE function code 'X'500' - KVM virtio functions +----------------------------------------------------- + +If the function code specifies 0x500, various virtio-related functions +are performed. + +General register 1 contains the virtio subfunction code. Supported +virtio subfunctions depend on KVM's userspace. Generally, userspace +provides either s390-virtio (subcodes 0-2) or virtio-ccw (subcode 3). + +Upon completion of the DIAGNOSE instruction, general register 2 contains +the function's return code, which is either a return code or a subcode +specific value. + +Subcode 0 - s390-virtio notification and early console printk + Handled by userspace. + +Subcode 1 - s390-virtio reset + Handled by userspace. + +Subcode 2 - s390-virtio set status + Handled by userspace. + +Subcode 3 - virtio-ccw notification + Handled by either userspace or KVM (ioeventfd case). + + General register 2 contains a subchannel-identification word denoting + the subchannel of the virtio-ccw proxy device to be notified. + + General register 3 contains the number of the virtqueue to be notified. + + General register 4 contains a 64bit identifier for KVM usage (the + kvm_io_bus cookie). If general register 4 does not contain a valid + identifier, it is ignored. + + After completion of the DIAGNOSE call, general register 2 may contain + a 64bit identifier (in the kvm_io_bus cookie case), or a negative + error value, if an internal error occurred. + + See also the virtio standard for a discussion of this hypercall. + + +DIAGNOSE function code 'X'501 - KVM breakpoint +---------------------------------------------- + +If the function code specifies 0x501, breakpoint functions may be performed. +This function code is handled by userspace. + +This diagnose function code has no subfunctions and uses no parameters. + + +DIAGNOSE function code 'X'9C - Voluntary Time Slice Yield +--------------------------------------------------------- + +General register 1 contains the target CPU address. + +In a guest of a hypervisor like LPAR, KVM or z/VM using shared host CPUs, +DIAGNOSE with function code 0x9c may improve system performance by +yielding the host CPU on which the guest CPU is running to be assigned +to another guest CPU, preferably the logical CPU containing the specified +target CPU. + + +DIAG 'X'9C forwarding ++++++++++++++++++++++ + +The guest may send a DIAGNOSE 0x9c in order to yield to a certain +other vcpu. An example is a Linux guest that tries to yield to the vcpu +that is currently holding a spinlock, but not running. + +However, on the host the real cpu backing the vcpu may itself not be +running. +Forwarding the DIAGNOSE 0x9c initially sent by the guest to yield to +the backing cpu will hopefully cause that cpu, and thus subsequently +the guest's vcpu, to be scheduled. + + +diag9c_forwarding_hz + KVM kernel parameter allowing to specify the maximum number of DIAGNOSE + 0x9c forwarding per second in the purpose of avoiding a DIAGNOSE 0x9c + forwarding storm. + A value of 0 turns the forwarding off. diff --git a/Documentation/virt/kvm/s390/s390-pv-boot.rst b/Documentation/virt/kvm/s390/s390-pv-boot.rst new file mode 100644 index 000000000000..73a6083cb5e7 --- /dev/null +++ b/Documentation/virt/kvm/s390/s390-pv-boot.rst @@ -0,0 +1,84 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================== +s390 (IBM Z) Boot/IPL of Protected VMs +====================================== + +Summary +------- +The memory of Protected Virtual Machines (PVMs) is not accessible to +I/O or the hypervisor. In those cases where the hypervisor needs to +access the memory of a PVM, that memory must be made accessible. +Memory made accessible to the hypervisor will be encrypted. See +Documentation/virt/kvm/s390-pv.rst for details." + +On IPL (boot) a small plaintext bootloader is started, which provides +information about the encrypted components and necessary metadata to +KVM to decrypt the protected virtual machine. + +Based on this data, KVM will make the protected virtual machine known +to the Ultravisor (UV) and instruct it to secure the memory of the +PVM, decrypt the components and verify the data and address list +hashes, to ensure integrity. Afterwards KVM can run the PVM via the +SIE instruction which the UV will intercept and execute on KVM's +behalf. + +As the guest image is just like an opaque kernel image that does the +switch into PV mode itself, the user can load encrypted guest +executables and data via every available method (network, dasd, scsi, +direct kernel, ...) without the need to change the boot process. + + +Diag308 +------- +This diagnose instruction is the basic mechanism to handle IPL and +related operations for virtual machines. The VM can set and retrieve +IPL information blocks, that specify the IPL method/devices and +request VM memory and subsystem resets, as well as IPLs. + +For PVMs this concept has been extended with new subcodes: + +Subcode 8: Set an IPL Information Block of type 5 (information block +for PVMs) +Subcode 9: Store the saved block in guest memory +Subcode 10: Move into Protected Virtualization mode + +The new PV load-device-specific-parameters field specifies all data +that is necessary to move into PV mode. + +* PV Header origin +* PV Header length +* List of Components composed of + * AES-XTS Tweak prefix + * Origin + * Size + +The PV header contains the keys and hashes, which the UV will use to +decrypt and verify the PV, as well as control flags and a start PSW. + +The components are for instance an encrypted kernel, kernel parameters +and initrd. The components are decrypted by the UV. + +After the initial import of the encrypted data, all defined pages will +contain the guest content. All non-specified pages will start out as +zero pages on first access. + + +When running in protected virtualization mode, some subcodes will result in +exceptions or return error codes. + +Subcodes 4 and 7, which specify operations that do not clear the guest +memory, will result in specification exceptions. This is because the +UV will clear all memory when a secure VM is removed, and therefore +non-clearing IPL subcodes are not allowed. + +Subcodes 8, 9, 10 will result in specification exceptions. +Re-IPL into a protected mode is only possible via a detour into non +protected mode. + +Keys +---- +Every CEC will have a unique public key to enable tooling to build +encrypted images. +See `s390-tools `_ +for the tooling. diff --git a/Documentation/virt/kvm/s390/s390-pv.rst b/Documentation/virt/kvm/s390/s390-pv.rst new file mode 100644 index 000000000000..8e41a3b63fa5 --- /dev/null +++ b/Documentation/virt/kvm/s390/s390-pv.rst @@ -0,0 +1,116 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +s390 (IBM Z) Ultravisor and Protected VMs +========================================= + +Summary +------- +Protected virtual machines (PVM) are KVM VMs that do not allow KVM to +access VM state like guest memory or guest registers. Instead, the +PVMs are mostly managed by a new entity called Ultravisor (UV). The UV +provides an API that can be used by PVMs and KVM to request management +actions. + +Each guest starts in non-protected mode and then may make a request to +transition into protected mode. On transition, KVM registers the guest +and its VCPUs with the Ultravisor and prepares everything for running +it. + +The Ultravisor will secure and decrypt the guest's boot memory +(i.e. kernel/initrd). It will safeguard state changes like VCPU +starts/stops and injected interrupts while the guest is running. + +As access to the guest's state, such as the SIE state description, is +normally needed to be able to run a VM, some changes have been made in +the behavior of the SIE instruction. A new format 4 state description +has been introduced, where some fields have different meanings for a +PVM. SIE exits are minimized as much as possible to improve speed and +reduce exposed guest state. + + +Interrupt injection +------------------- +Interrupt injection is safeguarded by the Ultravisor. As KVM doesn't +have access to the VCPUs' lowcores, injection is handled via the +format 4 state description. + +Machine check, external, IO and restart interruptions each can be +injected on SIE entry via a bit in the interrupt injection control +field (offset 0x54). If the guest cpu is not enabled for the interrupt +at the time of injection, a validity interception is recognized. The +format 4 state description contains fields in the interception data +block where data associated with the interrupt can be transported. + +Program and Service Call exceptions have another layer of +safeguarding; they can only be injected for instructions that have +been intercepted into KVM. The exceptions need to be a valid outcome +of an instruction emulation by KVM, e.g. we can never inject a +addressing exception as they are reported by SIE since KVM has no +access to the guest memory. + + +Mask notification interceptions +------------------------------- +KVM cannot intercept lctl(g) and lpsw(e) anymore in order to be +notified when a PVM enables a certain class of interrupt. As a +replacement, two new interception codes have been introduced: One +indicating that the contents of CRs 0, 6, or 14 have been changed, +indicating different interruption subclasses; and one indicating that +PSW bit 13 has been changed, indicating that a machine check +intervention was requested and those are now enabled. + +Instruction emulation +--------------------- +With the format 4 state description for PVMs, the SIE instruction already +interprets more instructions than it does with format 2. It is not able +to interpret every instruction, but needs to hand some tasks to KVM; +therefore, the SIE and the ultravisor safeguard emulation inputs and outputs. + +The control structures associated with SIE provide the Secure +Instruction Data Area (SIDA), the Interception Parameters (IP) and the +Secure Interception General Register Save Area. Guest GRs and most of +the instruction data, such as I/O data structures, are filtered. +Instruction data is copied to and from the SIDA when needed. Guest +GRs are put into / retrieved from the Secure Interception General +Register Save Area. + +Only GR values needed to emulate an instruction will be copied into this +save area and the real register numbers will be hidden. + +The Interception Parameters state description field still contains +the bytes of the instruction text, but with pre-set register values +instead of the actual ones. I.e. each instruction always uses the same +instruction text, in order not to leak guest instruction text. +This also implies that the register content that a guest had in r +may be in r from the hypervisor's point of view. + +The Secure Instruction Data Area contains instruction storage +data. Instruction data, i.e. data being referenced by an instruction +like the SCCB for sclp, is moved via the SIDA. When an instruction is +intercepted, the SIE will only allow data and program interrupts for +this instruction to be moved to the guest via the two data areas +discussed before. Other data is either ignored or results in validity +interceptions. + + +Instruction emulation interceptions +----------------------------------- +There are two types of SIE secure instruction intercepts: the normal +and the notification type. Normal secure instruction intercepts will +make the guest pending for instruction completion of the intercepted +instruction type, i.e. on SIE entry it is attempted to complete +emulation of the instruction with the data provided by KVM. That might +be a program exception or instruction completion. + +The notification type intercepts inform KVM about guest environment +changes due to guest instruction interpretation. Such an interception +is recognized, for example, for the store prefix instruction to provide +the new lowcore location. On SIE reentry, any KVM data in the data areas +is ignored and execution continues as if the guest instruction had +completed. For that reason KVM is not allowed to inject a program +interrupt. + +Links +----- +`KVM Forum 2019 presentation `_ diff --git a/Documentation/virt/kvm/timekeeping.rst b/Documentation/virt/kvm/timekeeping.rst deleted file mode 100644 index 21ae7efa29ba..000000000000 --- a/Documentation/virt/kvm/timekeeping.rst +++ /dev/null @@ -1,645 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -====================================================== -Timekeeping Virtualization for X86-Based Architectures -====================================================== - -:Author: Zachary Amsden -:Copyright: (c) 2010, Red Hat. All rights reserved. - -.. Contents - - 1) Overview - 2) Timing Devices - 3) TSC Hardware - 4) Virtualization Problems - -1. Overview -=========== - -One of the most complicated parts of the X86 platform, and specifically, -the virtualization of this platform is the plethora of timing devices available -and the complexity of emulating those devices. In addition, virtualization of -time introduces a new set of challenges because it introduces a multiplexed -division of time beyond the control of the guest CPU. - -First, we will describe the various timekeeping hardware available, then -present some of the problems which arise and solutions available, giving -specific recommendations for certain classes of KVM guests. - -The purpose of this document is to collect data and information relevant to -timekeeping which may be difficult to find elsewhere, specifically, -information relevant to KVM and hardware-based virtualization. - -2. Timing Devices -================= - -First we discuss the basic hardware devices available. TSC and the related -KVM clock are special enough to warrant a full exposition and are described in -the following section. - -2.1. i8254 - PIT ----------------- - -One of the first timer devices available is the programmable interrupt timer, -or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three -channels which can be programmed to deliver periodic or one-shot interrupts. -These three channels can be configured in different modes and have individual -counters. Channel 1 and 2 were not available for general use in the original -IBM PC, and historically were connected to control RAM refresh and the PC -speaker. Now the PIT is typically integrated as part of an emulated chipset -and a separate physical PIT is not used. - -The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done -using single or multiple byte access to the I/O ports. There are 6 modes -available, but not all modes are available to all timers, as only timer 2 -has a connected gate input, required for modes 1 and 5. The gate line is -controlled by port 61h, bit 0, as illustrated in the following diagram:: - - -------------- ---------------- - | | | | - | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 - | Clock | | | | - -------------- | +->| GATE TIMER 0 | - | ---------------- - | - | ---------------- - | | | - |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM - | | | (aka /dev/null) - | +->| GATE TIMER 1 | - | ---------------- - | - | ---------------- - | | | - |------>| CLOCK OUT | ---------> Port 61h, bit 5 - | | | - Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ - ---------------- _| )--|LPF|---Speaker - / *---- \___/ - Port 61h, bit 1 ---------------------------------/ - -The timer modes are now described. - -Mode 0: Single Timeout. - This is a one-shot software timeout that counts down - when the gate is high (always true for timers 0 and 1). When the count - reaches zero, the output goes high. - -Mode 1: Triggered One-shot. - The output is initially set high. When the gate - line is set high, a countdown is initiated (which does not stop if the gate is - lowered), during which the output is set low. When the count reaches zero, - the output goes high. - -Mode 2: Rate Generator. - The output is initially set high. When the countdown - reaches 1, the output goes low for one count and then returns high. The value - is reloaded and the countdown automatically resumes. If the gate line goes - low, the count is halted. If the output is low when the gate is lowered, the - output automatically goes high (this only affects timer 2). - -Mode 3: Square Wave. - This generates a high / low square wave. The count - determines the length of the pulse, which alternates between high and low - when zero is reached. The count only proceeds when gate is high and is - automatically reloaded on reaching zero. The count is decremented twice at - each clock to generate a full high / low cycle at the full periodic rate. - If the count is even, the clock remains high for N/2 counts and low for N/2 - counts; if the clock is odd, the clock is high for (N+1)/2 counts and low - for (N-1)/2 counts. Only even values are latched by the counter, so odd - values are not observed when reading. This is the intended mode for timer 2, - which generates sine-like tones by low-pass filtering the square wave output. - -Mode 4: Software Strobe. - After programming this mode and loading the counter, - the output remains high until the counter reaches zero. Then the output - goes low for 1 clock cycle and returns high. The counter is not reloaded. - Counting only occurs when gate is high. - -Mode 5: Hardware Strobe. - After programming and loading the counter, the - output remains high. When the gate is raised, a countdown is initiated - (which does not stop if the gate is lowered). When the counter reaches zero, - the output goes low for 1 clock cycle and then returns high. The counter is - not reloaded. - -In addition to normal binary counting, the PIT supports BCD counting. The -command port, 0x43 is used to set the counter and mode for each of the three -timers. - -PIT commands, issued to port 0x43, using the following bit encoding:: - - Bit 7-4: Command (See table below) - Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) - Bit 0 : Binary (0) / BCD (1) - -Command table:: - - 0000 - Latch Timer 0 count for port 0x40 - sample and hold the count to be read in port 0x40; - additional commands ignored until counter is read; - mode bits ignored. - - 0001 - Set Timer 0 LSB mode for port 0x40 - set timer to read LSB only and force MSB to zero; - mode bits set timer mode - - 0010 - Set Timer 0 MSB mode for port 0x40 - set timer to read MSB only and force LSB to zero; - mode bits set timer mode - - 0011 - Set Timer 0 16-bit mode for port 0x40 - set timer to read / write LSB first, then MSB; - mode bits set timer mode - - 0100 - Latch Timer 1 count for port 0x41 - as described above - 0101 - Set Timer 1 LSB mode for port 0x41 - as described above - 0110 - Set Timer 1 MSB mode for port 0x41 - as described above - 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above - - 1000 - Latch Timer 2 count for port 0x42 - as described above - 1001 - Set Timer 2 LSB mode for port 0x42 - as described above - 1010 - Set Timer 2 MSB mode for port 0x42 - as described above - 1011 - Set Timer 2 16-bit mode for port 0x42 as described above - - 1101 - General counter latch - Latch combination of counters into corresponding ports - Bit 3 = Counter 2 - Bit 2 = Counter 1 - Bit 1 = Counter 0 - Bit 0 = Unused - - 1110 - Latch timer status - Latch combination of counter mode into corresponding ports - Bit 3 = Counter 2 - Bit 2 = Counter 1 - Bit 1 = Counter 0 - - The output of ports 0x40-0x42 following this command will be: - - Bit 7 = Output pin - Bit 6 = Count loaded (0 if timer has expired) - Bit 5-4 = Read / Write mode - 01 = MSB only - 10 = LSB only - 11 = LSB / MSB (16-bit) - Bit 3-1 = Mode - Bit 0 = Binary (0) / BCD mode (1) - -2.2. RTC --------- - -The second device which was available in the original PC was the MC146818 real -time clock. The original device is now obsolete, and usually emulated by the -system chipset, sometimes by an HPET and some frankenstein IRQ routing. - -The RTC is accessed through CMOS variables, which uses an index register to -control which bytes are read. Since there is only one index register, read -of the CMOS and read of the RTC require lock protection (in addition, it is -dangerous to allow userspace utilities such as hwclock to have direct RTC -access, as they could corrupt kernel reads and writes of CMOS memory). - -The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt -can function as a periodic timer, an additional once a day alarm, and can issue -interrupts after an update of the CMOS registers by the MC146818 is complete. -The type of interrupt is signalled in the RTC status registers. - -The RTC will update the current time fields by battery power even while the -system is off. The current time fields should not be read while an update is -in progress, as indicated in the status register. - -The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be -programmed to a 32kHz divider if the RTC is to count seconds. - -This is the RAM map originally used for the RTC/CMOS:: - - Location Size Description - ------------------------------------------ - 00h byte Current second (BCD) - 01h byte Seconds alarm (BCD) - 02h byte Current minute (BCD) - 03h byte Minutes alarm (BCD) - 04h byte Current hour (BCD) - 05h byte Hours alarm (BCD) - 06h byte Current day of week (BCD) - 07h byte Current day of month (BCD) - 08h byte Current month (BCD) - 09h byte Current year (BCD) - 0Ah byte Register A - bit 7 = Update in progress - bit 6-4 = Divider for clock - 000 = 4.194 MHz - 001 = 1.049 MHz - 010 = 32 kHz - 10X = test modes - 110 = reset / disable - 111 = reset / disable - bit 3-0 = Rate selection for periodic interrupt - 000 = periodic timer disabled - 001 = 3.90625 uS - 010 = 7.8125 uS - 011 = .122070 mS - 100 = .244141 mS - ... - 1101 = 125 mS - 1110 = 250 mS - 1111 = 500 mS - 0Bh byte Register B - bit 7 = Run (0) / Halt (1) - bit 6 = Periodic interrupt enable - bit 5 = Alarm interrupt enable - bit 4 = Update-ended interrupt enable - bit 3 = Square wave interrupt enable - bit 2 = BCD calendar (0) / Binary (1) - bit 1 = 12-hour mode (0) / 24-hour mode (1) - bit 0 = 0 (DST off) / 1 (DST enabled) - OCh byte Register C (read only) - bit 7 = interrupt request flag (IRQF) - bit 6 = periodic interrupt flag (PF) - bit 5 = alarm interrupt flag (AF) - bit 4 = update interrupt flag (UF) - bit 3-0 = reserved - ODh byte Register D (read only) - bit 7 = RTC has power - bit 6-0 = reserved - 32h byte Current century BCD (*) - (*) location vendor specific and now determined from ACPI global tables - -2.3. APIC ---------- - -On Pentium and later processors, an on-board timer is available to each CPU -as part of the Advanced Programmable Interrupt Controller. The APIC is -accessed through memory-mapped registers and provides interrupt service to each -CPU, used for IPIs and local timer interrupts. - -Although in theory the APIC is a safe and stable source for local interrupts, -in practice, many bugs and glitches have occurred due to the special nature of -the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect -the use of the APIC and that workarounds may be required. In addition, some of -these workarounds pose unique constraints for virtualization - requiring either -extra overhead incurred from extra reads of memory-mapped I/O or additional -functionality that may be more computationally expensive to implement. - -Since the APIC is documented quite well in the Intel and AMD manuals, we will -avoid repetition of the detail here. It should be pointed out that the APIC -timer is programmed through the LVT (local vector timer) register, is capable -of one-shot or periodic operation, and is based on the bus clock divided down -by the programmable divider register. - -2.4. HPET ---------- - -HPET is quite complex, and was originally intended to replace the PIT / RTC -support of the X86 PC. It remains to be seen whether that will be the case, as -the de facto standard of PC hardware is to emulate these older devices. Some -systems designated as legacy free may support only the HPET as a hardware timer -device. - -The HPET spec is rather loose and vague, requiring at least 3 hardware timers, -but allowing implementation freedom to support many more. It also imposes no -fixed rate on the timer frequency, but does impose some extremal values on -frequency, error and slew. - -In general, the HPET is recommended as a high precision (compared to PIT /RTC) -time source which is independent of local variation (as there is only one HPET -in any given system). The HPET is also memory-mapped, and its presence is -indicated through ACPI tables by the BIOS. - -Detailed specification of the HPET is beyond the current scope of this -document, as it is also very well documented elsewhere. - -2.5. Offboard Timers --------------------- - -Several cards, both proprietary (watchdog boards) and commonplace (e1000) have -timing chips built into the cards which may have registers which are accessible -to kernel or user drivers. To the author's knowledge, using these to generate -a clocksource for a Linux or other kernel has not yet been attempted and is in -general frowned upon as not playing by the agreed rules of the game. Such a -timer device would require additional support to be virtualized properly and is -not considered important at this time as no known operating system does this. - -3. TSC Hardware -=============== - -The TSC or time stamp counter is relatively simple in theory; it counts -instruction cycles issued by the processor, which can be used as a measure of -time. In practice, due to a number of problems, it is the most complicated -timekeeping device to use. - -The TSC is represented internally as a 64-bit MSR which can be read with the -RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware -limitations made it possible to write the TSC, but generally on old hardware it -was only possible to write the low 32-bits of the 64-bit counter, and the upper -32-bits of the counter were cleared. Now, however, on Intel processors family -0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction -has been lifted and all 64-bits are writable. On AMD systems, the ability to -write the TSC MSR is not an architectural guarantee. - -The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by -means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. - -Some vendors have implemented an additional instruction, RDTSCP, which returns -atomically not just the TSC, but an indicator which corresponds to the -processor number. This can be used to index into an array of TSC variables to -determine offset information in SMP systems where TSCs are not synchronized. -The presence of this instruction must be determined by consulting CPUID feature -bits. - -Both VMX and SVM provide extension fields in the virtualization hardware which -allows the guest visible TSC to be offset by a constant. Newer implementations -promise to allow the TSC to additionally be scaled, but this hardware is not -yet widely available. - -3.1. TSC synchronization ------------------------- - -The TSC is a CPU-local clock in most implementations. This means, on SMP -platforms, the TSCs of different CPUs may start at different times depending -on when the CPUs are powered on. Generally, CPUs on the same die will share -the same clock, however, this is not always the case. - -The BIOS may attempt to resynchronize the TSCs during the poweron process and -the operating system or other system software may attempt to do this as well. -Several hardware limitations make the problem worse - if it is not possible to -write the full 64-bits of the TSC, it may be impossible to match the TSC in -newly arriving CPUs to that of the rest of the system, resulting in -unsynchronized TSCs. This may be done by BIOS or system software, but in -practice, getting a perfectly synchronized TSC will not be possible unless all -values are read from the same clock, which generally only is possible on single -socket systems or those with special hardware support. - -3.2. TSC and CPU hotplug ------------------------- - -As touched on already, CPUs which arrive later than the boot time of the system -may not have a TSC value that is synchronized with the rest of the system. -Either system software, BIOS, or SMM code may actually try to establish the TSC -to a value matching the rest of the system, but a perfect match is usually not -a guarantee. This can have the effect of bringing a system from a state where -TSC is synchronized back to a state where TSC synchronization flaws, however -small, may be exposed to the OS and any virtualization environment. - -3.3. TSC and multi-socket / NUMA --------------------------------- - -Multi-socket systems, especially large multi-socket systems are likely to have -individual clocksources rather than a single, universally distributed clock. -Since these clocks are driven by different crystals, they will not have -perfectly matched frequency, and temperature and electrical variations will -cause the CPU clocks, and thus the TSCs to drift over time. Depending on the -exact clock and bus design, the drift may or may not be fixed in absolute -error, and may accumulate over time. - -In addition, very large systems may deliberately slew the clocks of individual -cores. This technique, known as spread-spectrum clocking, reduces EMI at the -clock frequency and harmonics of it, which may be required to pass FCC -standards for telecommunications and computer equipment. - -It is recommended not to trust the TSCs to remain synchronized on NUMA or -multiple socket systems for these reasons. - -3.4. TSC and C-states ---------------------- - -C-states, or idling states of the processor, especially C1E and deeper sleep -states may be problematic for TSC as well. The TSC may stop advancing in such -a state, resulting in a TSC which is behind that of other CPUs when execution -is resumed. Such CPUs must be detected and flagged by the operating system -based on CPU and chipset identifications. - -The TSC in such a case may be corrected by catching it up to a known external -clocksource. - -3.5. TSC frequency change / P-states ------------------------------------- - -To make things slightly more interesting, some CPUs may change frequency. They -may or may not run the TSC at the same rate, and because the frequency change -may be staggered or slewed, at some points in time, the TSC rate may not be -known other than falling within a range of values. In this case, the TSC will -not be a stable time source, and must be calibrated against a known, stable, -external clock to be a usable source of time. - -Whether the TSC runs at a constant rate or scales with the P-state is model -dependent and must be determined by inspecting CPUID, chipset or vendor -specific MSR fields. - -In addition, some vendors have known bugs where the P-state is actually -compensated for properly during normal operation, but when the processor is -inactive, the P-state may be raised temporarily to service cache misses from -other processors. In such cases, the TSC on halted CPUs could advance faster -than that of non-halted processors. AMD Turion processors are known to have -this problem. - -3.6. TSC and STPCLK / T-states ------------------------------- - -External signals given to the processor may also have the effect of stopping -the TSC. This is typically done for thermal emergency power control to prevent -an overheating condition, and typically, there is no way to detect that this -condition has happened. - -3.7. TSC virtualization - VMX ------------------------------ - -VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP -instructions, which is enough for full virtualization of TSC in any manner. In -addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET -field specified in the VMCS. Special instructions must be used to read and -write the VMCS field. - -3.8. TSC virtualization - SVM ------------------------------ - -SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP -instructions, which is enough for full virtualization of TSC in any manner. In -addition, SVM allows passing through the host TSC plus an additional offset -field specified in the SVM control block. - -3.9. TSC feature bits in Linux ------------------------------- - -In summary, there is no way to guarantee the TSC remains in perfect -synchronization unless it is explicitly guaranteed by the architecture. Even -if so, the TSCs in multi-sockets or NUMA systems may still run independently -despite being locally consistent. - -The following feature bits are used by Linux to signal various TSC attributes, -but they can only be taken to be meaningful for UP or single node systems. - -========================= ======================================= -X86_FEATURE_TSC The TSC is available in hardware -X86_FEATURE_RDTSCP The RDTSCP instruction is available -X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states -X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states -X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) -========================= ======================================= - -4. Virtualization Problems -========================== - -Timekeeping is especially problematic for virtualization because a number of -challenges arise. The most obvious problem is that time is now shared between -the host and, potentially, a number of virtual machines. Thus the virtual -operating system does not run with 100% usage of the CPU, despite the fact that -it may very well make that assumption. It may expect it to remain true to very -exacting bounds when interrupt sources are disabled, but in reality only its -virtual interrupt sources are disabled, and the machine may still be preempted -at any time. This causes problems as the passage of real time, the injection -of machine interrupts and the associated clock sources are no longer completely -synchronized with real time. - -This same problem can occur on native hardware to a degree, as SMM mode may -steal cycles from the naturally on X86 systems when SMM mode is used by the -BIOS, but not in such an extreme fashion. However, the fact that SMM mode may -cause similar problems to virtualization makes it a good justification for -solving many of these problems on bare metal. - -4.1. Interrupt clocking ------------------------ - -One of the most immediate problems that occurs with legacy operating systems -is that the system timekeeping routines are often designed to keep track of -time by counting periodic interrupts. These interrupts may come from the PIT -or the RTC, but the problem is the same: the host virtualization engine may not -be able to deliver the proper number of interrupts per second, and so guest -time may fall behind. This is especially problematic if a high interrupt rate -is selected, such as 1000 HZ, which is unfortunately the default for many Linux -guests. - -There are three approaches to solving this problem; first, it may be possible -to simply ignore it. Guests which have a separate time source for tracking -'wall clock' or 'real time' may not need any adjustment of their interrupts to -maintain proper time. If this is not sufficient, it may be necessary to inject -additional interrupts into the guest in order to increase the effective -interrupt rate. This approach leads to complications in extreme conditions, -where host load or guest lag is too much to compensate for, and thus another -solution to the problem has risen: the guest may need to become aware of lost -ticks and compensate for them internally. Although promising in theory, the -implementation of this policy in Linux has been extremely error prone, and a -number of buggy variants of lost tick compensation are distributed across -commonly used Linux systems. - -Windows uses periodic RTC clocking as a means of keeping time internally, and -thus requires interrupt slewing to keep proper time. It does use a low enough -rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in -practice. - -4.2. TSC sampling and serialization ------------------------------------ - -As the highest precision time source available, the cycle counter of the CPU -has aroused much interest from developers. As explained above, this timer has -many problems unique to its nature as a local, potentially unstable and -potentially unsynchronized source. One issue which is not unique to the TSC, -but is highlighted because of its very precise nature is sampling delay. By -definition, the counter, once read is already old. However, it is also -possible for the counter to be read ahead of the actual use of the result. -This is a consequence of the superscalar execution of the instruction stream, -which may execute instructions out of order. Such execution is called -non-serialized. Forcing serialized execution is necessary for precise -measurement with the TSC, and requires a serializing instruction, such as CPUID -or an MSR read. - -Since CPUID may actually be virtualized by a trap and emulate mechanism, this -serialization can pose a performance issue for hardware virtualization. An -accurate time stamp counter reading may therefore not always be available, and -it may be necessary for an implementation to guard against "backwards" reads of -the TSC as seen from other CPUs, even in an otherwise perfectly synchronized -system. - -4.3. Timespec aliasing ----------------------- - -Additionally, this lack of serialization from the TSC poses another challenge -when using results of the TSC when measured against another time source. As -the TSC is much higher precision, many possible values of the TSC may be read -while another clock is still expressing the same value. - -That is, you may read (T,T+10) while external clock C maintains the same value. -Due to non-serialized reads, you may actually end up with a range which -fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but -calibrated against an external value may have a range of valid values. -Re-calibrating this computation may actually cause time, as computed after the -calibration, to go backwards, compared with time computed before the -calibration. - -This problem is particularly pronounced with an internal time source in Linux, -the kernel time, which is expressed in the theoretically high resolution -timespec - but which advances in much larger granularity intervals, sometimes -at the rate of jiffies, and possibly in catchup modes, at a much larger step. - -This aliasing requires care in the computation and recalibration of kvmclock -and any other values derived from TSC computation (such as TSC virtualization -itself). - -4.4. Migration --------------- - -Migration of a virtual machine raises problems for timekeeping in two ways. -First, the migration itself may take time, during which interrupts cannot be -delivered, and after which, the guest time may need to be caught up. NTP may -be able to help to some degree here, as the clock correction required is -typically small enough to fall in the NTP-correctable window. - -An additional concern is that timers based off the TSC (or HPET, if the raw bus -clock is exposed) may now be running at different rates, requiring compensation -in some way in the hypervisor by virtualizing these timers. In addition, -migrating to a faster machine may preclude the use of a passthrough TSC, as a -faster clock cannot be made visible to a guest without the potential of time -advancing faster than usual. A slower clock is less of a problem, as it can -always be caught up to the original rate. KVM clock avoids these problems by -simply storing multipliers and offsets against the TSC for the guest to convert -back into nanosecond resolution values. - -4.5. Scheduling ---------------- - -Since scheduling may be based on precise timing and firing of interrupts, the -scheduling algorithms of an operating system may be adversely affected by -virtualization. In theory, the effect is random and should be universally -distributed, but in contrived as well as real scenarios (guest device access, -causes of virtualization exits, possible context switch), this may not always -be the case. The effect of this has not been well studied. - -In an attempt to work around this, several implementations have provided a -paravirtualized scheduler clock, which reveals the true amount of CPU time for -which a virtual machine has been running. - -4.6. Watchdogs --------------- - -Watchdog timers, such as the lock detector in Linux may fire accidentally when -running under hardware virtualization due to timer interrupts being delayed or -misinterpretation of the passage of real time. Usually, these warnings are -spurious and can be ignored, but in some circumstances it may be necessary to -disable such detection. - -4.7. Delays and precision timing --------------------------------- - -Precise timing and delays may not be possible in a virtualized system. This -can happen if the system is controlling physical hardware, or issues delays to -compensate for slower I/O to and from devices. The first issue is not solvable -in general for a virtualized system; hardware control software can't be -adequately virtualized without a full real-time operating system, which would -require an RT aware virtualization platform. - -The second issue may cause performance problems, but this is unlikely to be a -significant issue. In many cases these delays may be eliminated through -configuration or paravirtualization. - -4.8. Covert channels and leaks ------------------------------- - -In addition to the above problems, time information will inevitably leak to the -guest about the host in anything but a perfect implementation of virtualized -time. This may allow the guest to infer the presence of a hypervisor (as in a -red-pill type detection), and it may allow information to leak between guests -by using CPU utilization itself as a signalling channel. Preventing such -problems would require completely isolated virtual time which may not track -real time any longer. This may be useful in certain security or QA contexts, -but in general isn't recommended for real-world deployment scenarios. diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst new file mode 100644 index 000000000000..1c6847fff304 --- /dev/null +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst @@ -0,0 +1,445 @@ +====================================== +Secure Encrypted Virtualization (SEV) +====================================== + +Overview +======== + +Secure Encrypted Virtualization (SEV) is a feature found on AMD processors. + +SEV is an extension to the AMD-V architecture which supports running +virtual machines (VMs) under the control of a hypervisor. When enabled, +the memory contents of a VM will be transparently encrypted with a key +unique to that VM. + +The hypervisor can determine the SEV support through the CPUID +instruction. The CPUID function 0x8000001f reports information related +to SEV:: + + 0x8000001f[eax]: + Bit[1] indicates support for SEV + ... + [ecx]: + Bits[31:0] Number of encrypted guests supported simultaneously + +If support for SEV is present, MSR 0xc001_0010 (MSR_AMD64_SYSCFG) and MSR 0xc001_0015 +(MSR_K7_HWCR) can be used to determine if it can be enabled:: + + 0xc001_0010: + Bit[23] 1 = memory encryption can be enabled + 0 = memory encryption can not be enabled + + 0xc001_0015: + Bit[0] 1 = memory encryption can be enabled + 0 = memory encryption can not be enabled + +When SEV support is available, it can be enabled in a specific VM by +setting the SEV bit before executing VMRUN.:: + + VMCB[0x90]: + Bit[1] 1 = SEV is enabled + 0 = SEV is disabled + +SEV hardware uses ASIDs to associate a memory encryption key with a VM. +Hence, the ASID for the SEV-enabled guests must be from 1 to a maximum value +defined in the CPUID 0x8000001f[ecx] field. + +SEV Key Management +================== + +The SEV guest key management is handled by a separate processor called the AMD +Secure Processor (AMD-SP). Firmware running inside the AMD-SP provides a secure +key management interface to perform common hypervisor activities such as +encrypting bootstrap code, snapshot, migrating and debugging the guest. For more +information, see the SEV Key Management spec [api-spec]_ + +The main ioctl to access SEV is KVM_MEMORY_ENCRYPT_OP. If the argument +to KVM_MEMORY_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled +and ``ENOTTY` if it is disabled (on some older versions of Linux, +the ioctl runs normally even with a NULL argument, and therefore will +likely return ``EFAULT``). If non-NULL, the argument to KVM_MEMORY_ENCRYPT_OP +must be a struct kvm_sev_cmd:: + + struct kvm_sev_cmd { + __u32 id; + __u64 data; + __u32 error; + __u32 sev_fd; + }; + + +The ``id`` field contains the subcommand, and the ``data`` field points to +another struct containing arguments specific to command. The ``sev_fd`` +should point to a file descriptor that is opened on the ``/dev/sev`` +device, if needed (see individual commands). + +On output, ``error`` is zero on success, or an error code. Error codes +are defined in ````. + +KVM implements the following commands to support common lifecycle events of SEV +guests, such as launching, running, snapshotting, migrating and decommissioning. + +1. KVM_SEV_INIT +--------------- + +The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform +context. In a typical workflow, this command should be the first command issued. + +The firmware can be initialized either by using its own non-volatile storage or +the OS can manage the NV storage for the firmware using the module parameter +``init_ex_path``. The file specified by ``init_ex_path`` must exist. To create +a new NV storage file allocate the file with 32KB bytes of 0xFF as required by +the SEV spec. + +Returns: 0 on success, -negative on error + +2. KVM_SEV_LAUNCH_START +----------------------- + +The KVM_SEV_LAUNCH_START command is used for creating the memory encryption +context. To create the encryption context, user must provide a guest policy, +the owner's public Diffie-Hellman (PDH) key and session information. + +Parameters: struct kvm_sev_launch_start (in/out) + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_start { + __u32 handle; /* if zero then firmware creates a new handle */ + __u32 policy; /* guest's policy */ + + __u64 dh_uaddr; /* userspace address pointing to the guest owner's PDH key */ + __u32 dh_len; + + __u64 session_addr; /* userspace address which points to the guest session information */ + __u32 session_len; + }; + +On success, the 'handle' field contains a new handle and on error, a negative value. + +KVM_SEV_LAUNCH_START requires the ``sev_fd`` field to be valid. + +For more details, see SEV spec Section 6.2. + +3. KVM_SEV_LAUNCH_UPDATE_DATA +----------------------------- + +The KVM_SEV_LAUNCH_UPDATE_DATA is used for encrypting a memory region. It also +calculates a measurement of the memory contents. The measurement is a signature +of the memory contents that can be sent to the guest owner as an attestation +that the memory was encrypted correctly by the firmware. + +Parameters (in): struct kvm_sev_launch_update_data + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_update { + __u64 uaddr; /* userspace address to be encrypted (must be 16-byte aligned) */ + __u32 len; /* length of the data to be encrypted (must be 16-byte aligned) */ + }; + +For more details, see SEV spec Section 6.3. + +4. KVM_SEV_LAUNCH_MEASURE +------------------------- + +The KVM_SEV_LAUNCH_MEASURE command is used to retrieve the measurement of the +data encrypted by the KVM_SEV_LAUNCH_UPDATE_DATA command. The guest owner may +wait to provide the guest with confidential information until it can verify the +measurement. Since the guest owner knows the initial contents of the guest at +boot, the measurement can be verified by comparing it to what the guest owner +expects. + +If len is zero on entry, the measurement blob length is written to len and +uaddr is unused. + +Parameters (in): struct kvm_sev_launch_measure + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_measure { + __u64 uaddr; /* where to copy the measurement */ + __u32 len; /* length of measurement blob */ + }; + +For more details on the measurement verification flow, see SEV spec Section 6.4. + +5. KVM_SEV_LAUNCH_FINISH +------------------------ + +After completion of the launch flow, the KVM_SEV_LAUNCH_FINISH command can be +issued to make the guest ready for the execution. + +Returns: 0 on success, -negative on error + +6. KVM_SEV_GUEST_STATUS +----------------------- + +The KVM_SEV_GUEST_STATUS command is used to retrieve status information about a +SEV-enabled guest. + +Parameters (out): struct kvm_sev_guest_status + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_guest_status { + __u32 handle; /* guest handle */ + __u32 policy; /* guest policy */ + __u8 state; /* guest state (see enum below) */ + }; + +SEV guest state: + +:: + + enum { + SEV_STATE_INVALID = 0; + SEV_STATE_LAUNCHING, /* guest is currently being launched */ + SEV_STATE_SECRET, /* guest is being launched and ready to accept the ciphertext data */ + SEV_STATE_RUNNING, /* guest is fully launched and running */ + SEV_STATE_RECEIVING, /* guest is being migrated in from another SEV machine */ + SEV_STATE_SENDING /* guest is getting migrated out to another SEV machine */ + }; + +7. KVM_SEV_DBG_DECRYPT +---------------------- + +The KVM_SEV_DEBUG_DECRYPT command can be used by the hypervisor to request the +firmware to decrypt the data at the given memory region. + +Parameters (in): struct kvm_sev_dbg + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_dbg { + __u64 src_uaddr; /* userspace address of data to decrypt */ + __u64 dst_uaddr; /* userspace address of destination */ + __u32 len; /* length of memory region to decrypt */ + }; + +The command returns an error if the guest policy does not allow debugging. + +8. KVM_SEV_DBG_ENCRYPT +---------------------- + +The KVM_SEV_DEBUG_ENCRYPT command can be used by the hypervisor to request the +firmware to encrypt the data at the given memory region. + +Parameters (in): struct kvm_sev_dbg + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_dbg { + __u64 src_uaddr; /* userspace address of data to encrypt */ + __u64 dst_uaddr; /* userspace address of destination */ + __u32 len; /* length of memory region to encrypt */ + }; + +The command returns an error if the guest policy does not allow debugging. + +9. KVM_SEV_LAUNCH_SECRET +------------------------ + +The KVM_SEV_LAUNCH_SECRET command can be used by the hypervisor to inject secret +data after the measurement has been validated by the guest owner. + +Parameters (in): struct kvm_sev_launch_secret + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_secret { + __u64 hdr_uaddr; /* userspace address containing the packet header */ + __u32 hdr_len; + + __u64 guest_uaddr; /* the guest memory region where the secret should be injected */ + __u32 guest_len; + + __u64 trans_uaddr; /* the hypervisor memory region which contains the secret */ + __u32 trans_len; + }; + +10. KVM_SEV_GET_ATTESTATION_REPORT +---------------------------------- + +The KVM_SEV_GET_ATTESTATION_REPORT command can be used by the hypervisor to query the attestation +report containing the SHA-256 digest of the guest memory and VMSA passed through the KVM_SEV_LAUNCH +commands and signed with the PEK. The digest returned by the command should match the digest +used by the guest owner with the KVM_SEV_LAUNCH_MEASURE. + +If len is zero on entry, the measurement blob length is written to len and +uaddr is unused. + +Parameters (in): struct kvm_sev_attestation + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_attestation_report { + __u8 mnonce[16]; /* A random mnonce that will be placed in the report */ + + __u64 uaddr; /* userspace address where the report should be copied */ + __u32 len; + }; + +11. KVM_SEV_SEND_START +---------------------- + +The KVM_SEV_SEND_START command can be used by the hypervisor to create an +outgoing guest encryption context. + +If session_len is zero on entry, the length of the guest session information is +written to session_len and all other fields are not used. + +Parameters (in): struct kvm_sev_send_start + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_send_start { + __u32 policy; /* guest policy */ + + __u64 pdh_cert_uaddr; /* platform Diffie-Hellman certificate */ + __u32 pdh_cert_len; + + __u64 plat_certs_uaddr; /* platform certificate chain */ + __u32 plat_certs_len; + + __u64 amd_certs_uaddr; /* AMD certificate */ + __u32 amd_certs_len; + + __u64 session_uaddr; /* Guest session information */ + __u32 session_len; + }; + +12. KVM_SEV_SEND_UPDATE_DATA +---------------------------- + +The KVM_SEV_SEND_UPDATE_DATA command can be used by the hypervisor to encrypt the +outgoing guest memory region with the encryption context creating using +KVM_SEV_SEND_START. + +If hdr_len or trans_len are zero on entry, the length of the packet header and +transport region are written to hdr_len and trans_len respectively, and all +other fields are not used. + +Parameters (in): struct kvm_sev_send_update_data + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_send_update_data { + __u64 hdr_uaddr; /* userspace address containing the packet header */ + __u32 hdr_len; + + __u64 guest_uaddr; /* the source memory region to be encrypted */ + __u32 guest_len; + + __u64 trans_uaddr; /* the destination memory region */ + __u32 trans_len; + }; + +13. KVM_SEV_SEND_FINISH +------------------------ + +After completion of the migration flow, the KVM_SEV_SEND_FINISH command can be +issued by the hypervisor to delete the encryption context. + +Returns: 0 on success, -negative on error + +14. KVM_SEV_SEND_CANCEL +------------------------ + +After completion of SEND_START, but before SEND_FINISH, the source VMM can issue the +SEND_CANCEL command to stop a migration. This is necessary so that a cancelled +migration can restart with a new target later. + +Returns: 0 on success, -negative on error + +15. KVM_SEV_RECEIVE_START +------------------------- + +The KVM_SEV_RECEIVE_START command is used for creating the memory encryption +context for an incoming SEV guest. To create the encryption context, the user must +provide a guest policy, the platform public Diffie-Hellman (PDH) key and session +information. + +Parameters: struct kvm_sev_receive_start (in/out) + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_receive_start { + __u32 handle; /* if zero then firmware creates a new handle */ + __u32 policy; /* guest's policy */ + + __u64 pdh_uaddr; /* userspace address pointing to the PDH key */ + __u32 pdh_len; + + __u64 session_uaddr; /* userspace address which points to the guest session information */ + __u32 session_len; + }; + +On success, the 'handle' field contains a new handle and on error, a negative value. + +For more details, see SEV spec Section 6.12. + +16. KVM_SEV_RECEIVE_UPDATE_DATA +------------------------------- + +The KVM_SEV_RECEIVE_UPDATE_DATA command can be used by the hypervisor to copy +the incoming buffers into the guest memory region with encryption context +created during the KVM_SEV_RECEIVE_START. + +Parameters (in): struct kvm_sev_receive_update_data + +Returns: 0 on success, -negative on error + +:: + + struct kvm_sev_launch_receive_update_data { + __u64 hdr_uaddr; /* userspace address containing the packet header */ + __u32 hdr_len; + + __u64 guest_uaddr; /* the destination guest memory region */ + __u32 guest_len; + + __u64 trans_uaddr; /* the incoming buffer memory region */ + __u32 trans_len; + }; + +17. KVM_SEV_RECEIVE_FINISH +-------------------------- + +After completion of the migration flow, the KVM_SEV_RECEIVE_FINISH command can be +issued by the hypervisor to make the guest ready for execution. + +Returns: 0 on success, -negative on error + +References +========== + + +See [white-paper]_, [api-spec]_, [amd-apm]_ and [kvm-forum]_ for more info. + +.. [white-paper] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf +.. [api-spec] https://support.amd.com/TechDocs/55766_SEV-KM_API_Specification.pdf +.. [amd-apm] https://support.amd.com/TechDocs/24593.pdf (section 15.34) +.. [kvm-forum] https://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf diff --git a/Documentation/virt/kvm/x86/cpuid.rst b/Documentation/virt/kvm/x86/cpuid.rst new file mode 100644 index 000000000000..bda3e3e737d7 --- /dev/null +++ b/Documentation/virt/kvm/x86/cpuid.rst @@ -0,0 +1,124 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============== +KVM CPUID bits +============== + +:Author: Glauber Costa + +A guest running on a kvm host, can check some of its features using +cpuid. This is not always guaranteed to work, since userspace can +mask-out some, or even all KVM-related cpuid features before launching +a guest. + +KVM cpuid functions are: + +function: KVM_CPUID_SIGNATURE (0x40000000) + +returns:: + + eax = 0x40000001 + ebx = 0x4b4d564b + ecx = 0x564b4d56 + edx = 0x4d + +Note that this value in ebx, ecx and edx corresponds to the string "KVMKVMKVM". +The value in eax corresponds to the maximum cpuid function present in this leaf, +and will be updated if more functions are added in the future. +Note also that old hosts set eax value to 0x0. This should +be interpreted as if the value was 0x40000001. +This function queries the presence of KVM cpuid leafs. + +function: define KVM_CPUID_FEATURES (0x40000001) + +returns:: + + ebx, ecx + eax = an OR'ed group of (1 << flag) + +where ``flag`` is defined as below: + +================================== =========== ================================ +flag value meaning +================================== =========== ================================ +KVM_FEATURE_CLOCKSOURCE 0 kvmclock available at msrs + 0x11 and 0x12 + +KVM_FEATURE_NOP_IO_DELAY 1 not necessary to perform delays + on PIO operations + +KVM_FEATURE_MMU_OP 2 deprecated + +KVM_FEATURE_CLOCKSOURCE2 3 kvmclock available at msrs + 0x4b564d00 and 0x4b564d01 + +KVM_FEATURE_ASYNC_PF 4 async pf can be enabled by + writing to msr 0x4b564d02 + +KVM_FEATURE_STEAL_TIME 5 steal time can be enabled by + writing to msr 0x4b564d03 + +KVM_FEATURE_PV_EOI 6 paravirtualized end of interrupt + handler can be enabled by + writing to msr 0x4b564d04 + +KVM_FEATURE_PV_UNHALT 7 guest checks this feature bit + before enabling paravirtualized + spinlock support + +KVM_FEATURE_PV_TLB_FLUSH 9 guest checks this feature bit + before enabling paravirtualized + tlb flush + +KVM_FEATURE_ASYNC_PF_VMEXIT 10 paravirtualized async PF VM EXIT + can be enabled by setting bit 2 + when writing to msr 0x4b564d02 + +KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit + before enabling paravirtualized + send IPIs + +KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can + be disabled by writing + to msr 0x4b564d05. + +KVM_FEATURE_PV_SCHED_YIELD 13 guest checks this feature bit + before using paravirtualized + sched yield. + +KVM_FEATURE_ASYNC_PF_INT 14 guest checks this feature bit + before using the second async + pf control msr 0x4b564d06 and + async pf acknowledgment msr + 0x4b564d07. + +KVM_FEATURE_MSI_EXT_DEST_ID 15 guest checks this feature bit + before using extended destination + ID bits in MSI address bits 11-5. + +KVM_FEATURE_HC_MAP_GPA_RANGE 16 guest checks this feature bit before + using the map gpa range hypercall + to notify the page state change + +KVM_FEATURE_MIGRATION_CONTROL 17 guest checks this feature bit before + using MSR_KVM_MIGRATION_CONTROL + +KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24 host will warn if no guest-side + per-cpu warps are expected in + kvmclock +================================== =========== ================================ + +:: + + edx = an OR'ed group of (1 << flag) + +Where ``flag`` here is defined as below: + +================== ============ ================================= +flag value meaning +================== ============ ================================= +KVM_HINTS_REALTIME 0 guest checks this feature bit to + determine that vCPUs are never + preempted for an unlimited time + allowing optimizations +================== ============ ================================= diff --git a/Documentation/virt/kvm/x86/halt-polling.rst b/Documentation/virt/kvm/x86/halt-polling.rst new file mode 100644 index 000000000000..4922e4a15f18 --- /dev/null +++ b/Documentation/virt/kvm/x86/halt-polling.rst @@ -0,0 +1,140 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=========================== +The KVM halt polling system +=========================== + +The KVM halt polling system provides a feature within KVM whereby the latency +of a guest can, under some circumstances, be reduced by polling in the host +for some time period after the guest has elected to no longer run by cedeing. +That is, when a guest vcpu has ceded, or in the case of powerpc when all of the +vcpus of a single vcore have ceded, the host kernel polls for wakeup conditions +before giving up the cpu to the scheduler in order to let something else run. + +Polling provides a latency advantage in cases where the guest can be run again +very quickly by at least saving us a trip through the scheduler, normally on +the order of a few micro-seconds, although performance benefits are workload +dependant. In the event that no wakeup source arrives during the polling +interval or some other task on the runqueue is runnable the scheduler is +invoked. Thus halt polling is especially useful on workloads with very short +wakeup periods where the time spent halt polling is minimised and the time +savings of not invoking the scheduler are distinguishable. + +The generic halt polling code is implemented in: + + virt/kvm/kvm_main.c: kvm_vcpu_block() + +The powerpc kvm-hv specific case is implemented in: + + arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked() + +Halt Polling Interval +===================== + +The maximum time for which to poll before invoking the scheduler, referred to +as the halt polling interval, is increased and decreased based on the perceived +effectiveness of the polling in an attempt to limit pointless polling. +This value is stored in either the vcpu struct: + + kvm_vcpu->halt_poll_ns + +or in the case of powerpc kvm-hv, in the vcore struct: + + kvmppc_vcore->halt_poll_ns + +Thus this is a per vcpu (or vcore) value. + +During polling if a wakeup source is received within the halt polling interval, +the interval is left unchanged. In the event that a wakeup source isn't +received during the polling interval (and thus schedule is invoked) there are +two options, either the polling interval and total block time[0] were less than +the global max polling interval (see module params below), or the total block +time was greater than the global max polling interval. + +In the event that both the polling interval and total block time were less than +the global max polling interval then the polling interval can be increased in +the hope that next time during the longer polling interval the wake up source +will be received while the host is polling and the latency benefits will be +received. The polling interval is grown in the function grow_halt_poll_ns() and +is multiplied by the module parameters halt_poll_ns_grow and +halt_poll_ns_grow_start. + +In the event that the total block time was greater than the global max polling +interval then the host will never poll for long enough (limited by the global +max) to wakeup during the polling interval so it may as well be shrunk in order +to avoid pointless polling. The polling interval is shrunk in the function +shrink_halt_poll_ns() and is divided by the module parameter +halt_poll_ns_shrink, or set to 0 iff halt_poll_ns_shrink == 0. + +It is worth noting that this adjustment process attempts to hone in on some +steady state polling interval but will only really do a good job for wakeups +which come at an approximately constant rate, otherwise there will be constant +adjustment of the polling interval. + +[0] total block time: + the time between when the halt polling function is + invoked and a wakeup source received (irrespective of + whether the scheduler is invoked within that function). + +Module Parameters +================= + +The kvm module has 3 tuneable module parameters to adjust the global max +polling interval as well as the rate at which the polling interval is grown and +shrunk. These variables are defined in include/linux/kvm_host.h and as module +parameters in virt/kvm/kvm_main.c, or arch/powerpc/kvm/book3s_hv.c in the +powerpc kvm-hv case. + ++-----------------------+---------------------------+-------------------------+ +|Module Parameter | Description | Default Value | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns | The global max polling | KVM_HALT_POLL_NS_DEFAULT| +| | interval which defines | | +| | the ceiling value of the | | +| | polling interval for | (per arch value) | +| | each vcpu. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_grow | The value by which the | 2 | +| | halt polling interval is | | +| | multiplied in the | | +| | grow_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_grow_start| The initial value to grow | 10000 | +| | to from zero in the | | +| | grow_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ +|halt_poll_ns_shrink | The value by which the | 0 | +| | halt polling interval is | | +| | divided in the | | +| | shrink_halt_poll_ns() | | +| | function. | | ++-----------------------+---------------------------+-------------------------+ + +These module parameters can be set from the debugfs files in: + + /sys/module/kvm/parameters/ + +Note: that these module parameters are system wide values and are not able to + be tuned on a per vm basis. + +Further Notes +============= + +- Care should be taken when setting the halt_poll_ns module parameter as a large value + has the potential to drive the cpu usage to 100% on a machine which would be almost + entirely idle otherwise. This is because even if a guest has wakeups during which very + little work is done and which are quite far apart, if the period is shorter than the + global max polling interval (halt_poll_ns) then the host will always poll for the + entire block time and thus cpu utilisation will go to 100%. + +- Halt polling essentially presents a trade off between power usage and latency and + the module parameters should be used to tune the affinity for this. Idle cpu time is + essentially converted to host kernel time with the aim of decreasing latency when + entering the guest. + +- Halt polling will only be conducted by the host when no other tasks are runnable on + that cpu, otherwise the polling will cease immediately and schedule will be invoked to + allow that other task to run. Thus this doesn't allow a guest to denial of service the + cpu. diff --git a/Documentation/virt/kvm/x86/hypercalls.rst b/Documentation/virt/kvm/x86/hypercalls.rst new file mode 100644 index 000000000000..e56fa8b9cfca --- /dev/null +++ b/Documentation/virt/kvm/x86/hypercalls.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +Linux KVM Hypercall +=================== + +X86: + KVM Hypercalls have a three-byte sequence of either the vmcall or the vmmcall + instruction. The hypervisor can replace it with instructions that are + guaranteed to be supported. + + Up to four arguments may be passed in rbx, rcx, rdx, and rsi respectively. + The hypercall number should be placed in rax and the return value will be + placed in rax. No other registers will be clobbered unless explicitly stated + by the particular hypercall. + +S390: + R2-R7 are used for parameters 1-6. In addition, R1 is used for hypercall + number. The return value is written to R2. + + S390 uses diagnose instruction as hypercall (0x500) along with hypercall + number in R1. + + For further information on the S390 diagnose call as supported by KVM, + refer to Documentation/virt/kvm/s390-diag.rst. + +PowerPC: + It uses R3-R10 and hypercall number in R11. R4-R11 are used as output registers. + Return value is placed in R3. + + KVM hypercalls uses 4 byte opcode, that are patched with 'hypercall-instructions' + property inside the device tree's /hypervisor node. + For more information refer to Documentation/virt/kvm/ppc-pv.rst + +MIPS: + KVM hypercalls use the HYPCALL instruction with code 0 and the hypercall + number in $2 (v0). Up to four arguments may be placed in $4-$7 (a0-a3) and + the return value is placed in $2 (v0). + +KVM Hypercalls Documentation +============================ + +The template for each hypercall is: +1. Hypercall name. +2. Architecture(s) +3. Status (deprecated, obsolete, active) +4. Purpose + +1. KVM_HC_VAPIC_POLL_IRQ +------------------------ + +:Architecture: x86 +:Status: active +:Purpose: Trigger guest exit so that the host can check for pending + interrupts on reentry. + +2. KVM_HC_MMU_OP +---------------- + +:Architecture: x86 +:Status: deprecated. +:Purpose: Support MMU operations such as writing to PTE, + flushing TLB, release PT. + +3. KVM_HC_FEATURES +------------------ + +:Architecture: PPC +:Status: active +:Purpose: Expose hypercall availability to the guest. On x86 platforms, cpuid + used to enumerate which hypercalls are available. On PPC, either + device tree based lookup ( which is also what EPAPR dictates) + OR KVM specific enumeration mechanism (which is this hypercall) + can be used. + +4. KVM_HC_PPC_MAP_MAGIC_PAGE +---------------------------- + +:Architecture: PPC +:Status: active +:Purpose: To enable communication between the hypervisor and guest there is a + shared page that contains parts of supervisor visible register state. + The guest can map this shared page to access its supervisor register + through memory using this hypercall. + +5. KVM_HC_KICK_CPU +------------------ + +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to wakeup a vcpu from HLT state +:Usage example: + A vcpu of a paravirtualized guest that is busywaiting in guest + kernel mode for an event to occur (ex: a spinlock to become available) can + execute HLT instruction once it has busy-waited for more than a threshold + time-interval. Execution of HLT instruction would cause the hypervisor to put + the vcpu to sleep until occurrence of an appropriate event. Another vcpu of the + same guest can wakeup the sleeping vcpu by issuing KVM_HC_KICK_CPU hypercall, + specifying APIC ID (a1) of the vcpu to be woken up. An additional argument (a0) + is used in the hypercall for future use. + + +6. KVM_HC_CLOCK_PAIRING +----------------------- +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to synchronize host and guest clocks. + +Usage: + +a0: guest physical address where host copies +"struct kvm_clock_offset" structure. + +a1: clock_type, ATM only KVM_CLOCK_PAIRING_WALLCLOCK (0) +is supported (corresponding to the host's CLOCK_REALTIME clock). + + :: + + struct kvm_clock_pairing { + __s64 sec; + __s64 nsec; + __u64 tsc; + __u32 flags; + __u32 pad[9]; + }; + + Where: + * sec: seconds from clock_type clock. + * nsec: nanoseconds from clock_type clock. + * tsc: guest TSC value used to calculate sec/nsec pair + * flags: flags, unused (0) at the moment. + +The hypercall lets a guest compute a precise timestamp across +host and guest. The guest can use the returned TSC value to +compute the CLOCK_REALTIME for its clock, at the same instant. + +Returns KVM_EOPNOTSUPP if the host does not use TSC clocksource, +or if clock type is different than KVM_CLOCK_PAIRING_WALLCLOCK. + +6. KVM_HC_SEND_IPI +------------------ + +:Architecture: x86 +:Status: active +:Purpose: Send IPIs to multiple vCPUs. + +- a0: lower part of the bitmap of destination APIC IDs +- a1: higher part of the bitmap of destination APIC IDs +- a2: the lowest APIC ID in bitmap +- a3: APIC ICR + +The hypercall lets a guest send multicast IPIs, with at most 128 +128 destinations per hypercall in 64-bit mode and 64 vCPUs per +hypercall in 32-bit mode. The destinations are represented by a +bitmap contained in the first two arguments (a0 and a1). Bit 0 of +a0 corresponds to the APIC ID in the third argument (a2), bit 1 +corresponds to the APIC ID a2+1, and so on. + +Returns the number of CPUs to which the IPIs were delivered successfully. + +7. KVM_HC_SCHED_YIELD +--------------------- + +:Architecture: x86 +:Status: active +:Purpose: Hypercall used to yield if the IPI target vCPU is preempted + +a0: destination APIC ID + +:Usage example: When sending a call-function IPI-many to vCPUs, yield if + any of the IPI target vCPUs was preempted. + +8. KVM_HC_MAP_GPA_RANGE +------------------------- +:Architecture: x86 +:Status: active +:Purpose: Request KVM to map a GPA range with the specified attributes. + +a0: the guest physical address of the start page +a1: the number of (4kb) pages (must be contiguous in GPA space) +a2: attributes + + Where 'attributes' : + * bits 3:0 - preferred page size encoding 0 = 4kb, 1 = 2mb, 2 = 1gb, etc... + * bit 4 - plaintext = 0, encrypted = 1 + * bits 63:5 - reserved (must be zero) + +**Implementation note**: this hypercall is implemented in userspace via +the KVM_CAP_EXIT_HYPERCALL capability. Userspace must enable that capability +before advertising KVM_FEATURE_HC_MAP_GPA_RANGE in the guest CPUID. In +addition, if the guest supports KVM_FEATURE_MIGRATION_CONTROL, userspace +must also set up an MSR filter to process writes to MSR_KVM_MIGRATION_CONTROL. diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst new file mode 100644 index 000000000000..55ede8e070b6 --- /dev/null +++ b/Documentation/virt/kvm/x86/index.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================== +KVM for x86 systems +=================== + +.. toctree:: + :maxdepth: 2 + + amd-memory-encryption + cpuid + halt-polling + hypercalls + mmu + msr + nested-vmx + running-nested-guests + timekeeping diff --git a/Documentation/virt/kvm/x86/mmu.rst b/Documentation/virt/kvm/x86/mmu.rst new file mode 100644 index 000000000000..5b1ebad24c77 --- /dev/null +++ b/Documentation/virt/kvm/x86/mmu.rst @@ -0,0 +1,480 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================== +The x86 kvm shadow mmu +====================== + +The mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible +for presenting a standard x86 mmu to the guest, while translating guest +physical addresses to host physical addresses. + +The mmu code attempts to satisfy the following requirements: + +- correctness: + the guest should not be able to determine that it is running + on an emulated mmu except for timing (we attempt to comply + with the specification, not emulate the characteristics of + a particular implementation such as tlb size) +- security: + the guest must not be able to touch host memory not assigned + to it +- performance: + minimize the performance penalty imposed by the mmu +- scaling: + need to scale to large memory and large vcpu guests +- hardware: + support the full range of x86 virtualization hardware +- integration: + Linux memory management code must be in control of guest memory + so that swapping, page migration, page merging, transparent + hugepages, and similar features work without change +- dirty tracking: + report writes to guest memory to enable live migration + and framebuffer-based displays +- footprint: + keep the amount of pinned kernel memory low (most memory + should be shrinkable) +- reliability: + avoid multipage or GFP_ATOMIC allocations + +Acronyms +======== + +==== ==================================================================== +pfn host page frame number +hpa host physical address +hva host virtual address +gfn guest frame number +gpa guest physical address +gva guest virtual address +ngpa nested guest physical address +ngva nested guest virtual address +pte page table entry (used also to refer generically to paging structure + entries) +gpte guest pte (referring to gfns) +spte shadow pte (referring to pfns) +tdp two dimensional paging (vendor neutral term for NPT and EPT) +==== ==================================================================== + +Virtual and real hardware supported +=================================== + +The mmu supports first-generation mmu hardware, which allows an atomic switch +of the current paging mode and cr3 during guest entry, as well as +two-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware +it exposes is the traditional 2/3/4 level x86 mmu, with support for global +pages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also +able to expose NPT capable hardware on NPT capable hosts. + +Translation +=========== + +The primary job of the mmu is to program the processor's mmu to translate +addresses for the guest. Different translations are required at different +times: + +- when guest paging is disabled, we translate guest physical addresses to + host physical addresses (gpa->hpa) +- when guest paging is enabled, we translate guest virtual addresses, to + guest physical addresses, to host physical addresses (gva->gpa->hpa) +- when the guest launches a guest of its own, we translate nested guest + virtual addresses, to nested guest physical addresses, to guest physical + addresses, to host physical addresses (ngva->ngpa->gpa->hpa) + +The primary challenge is to encode between 1 and 3 translations into hardware +that support only 1 (traditional) and 2 (tdp) translations. When the +number of required translations matches the hardware, the mmu operates in +direct mode; otherwise it operates in shadow mode (see below). + +Memory +====== + +Guest memory (gpa) is part of the user address space of the process that is +using kvm. Userspace defines the translation between guest addresses and user +addresses (gpa->hva); note that two gpas may alias to the same hva, but not +vice versa. + +These hvas may be backed using any method available to the host: anonymous +memory, file backed memory, and device memory. Memory might be paged by the +host at any time. + +Events +====== + +The mmu is driven by events, some from the guest, some from the host. + +Guest generated events: + +- writes to control registers (especially cr3) +- invlpg/invlpga instruction execution +- access to missing or protected translations + +Host generated events: + +- changes in the gpa->hpa translation (either through gpa->hva changes or + through hva->hpa changes) +- memory pressure (the shrinker) + +Shadow pages +============ + +The principal data structure is the shadow page, 'struct kvm_mmu_page'. A +shadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A +shadow page may contain a mix of leaf and nonleaf sptes. + +A nonleaf spte allows the hardware mmu to reach the leaf pages and +is not related to a translation directly. It points to other shadow pages. + +A leaf spte corresponds to either one or two translations encoded into +one paging structure entry. These are always the lowest level of the +translation stack, with optional higher level translations left to NPT/EPT. +Leaf ptes point at guest pages. + +The following table shows translations encoded by leaf ptes, with higher-level +translations in parentheses: + + Non-nested guests:: + + nonpaging: gpa->hpa + paging: gva->gpa->hpa + paging, tdp: (gva->)gpa->hpa + + Nested guests:: + + non-tdp: ngva->gpa->hpa (*) + tdp: (ngva->)ngpa->gpa->hpa + + (*) the guest hypervisor will encode the ngva->gpa translation into its page + tables if npt is not present + +Shadow pages contain the following information: + role.level: + The level in the shadow paging hierarchy that this shadow page belongs to. + 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. + role.direct: + If set, leaf sptes reachable from this page are for a linear range. + Examples include real mode translation, large guest pages backed by small + host pages, and gpa->hpa translations when NPT or EPT is active. + The linear range starts at (gfn << PAGE_SHIFT) and its size is determined + by role.level (2MB for first level, 1GB for second level, 0.5TB for third + level, 256TB for fourth level) + If clear, this page corresponds to a guest page table denoted by the gfn + field. + role.quadrant: + When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit + sptes. That means a guest page table contains more ptes than the host, + so multiple shadow pages are needed to shadow one guest page. + For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the + first or second 512-gpte block in the guest page table. For second-level + page tables, each 32-bit gpte is converted to two 64-bit sptes + (since each first-level guest page is shadowed by two first-level + shadow pages) so role.quadrant takes values in the range 0..3. Each + quadrant maps 1GB virtual address space. + role.access: + Inherited guest access permissions from the parent ptes in the form uwx. + Note execute permission is positive, not negative. + role.invalid: + The page is invalid and should not be used. It is a root page that is + currently pinned (by a cpu hardware register pointing to it); once it is + unpinned it will be destroyed. + role.has_4_byte_gpte: + Reflects the size of the guest PTE for which the page is valid, i.e. '0' + if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use. + role.efer_nx: + Contains the value of efer.nx for which the page is valid. + role.cr0_wp: + Contains the value of cr0.wp for which the page is valid. + role.smep_andnot_wp: + Contains the value of cr4.smep && !cr0.wp for which the page is valid + (pages for which this is true are different from other pages; see the + treatment of cr0.wp=0 below). + role.smap_andnot_wp: + Contains the value of cr4.smap && !cr0.wp for which the page is valid + (pages for which this is true are different from other pages; see the + treatment of cr0.wp=0 below). + role.smm: + Is 1 if the page is valid in system management mode. This field + determines which of the kvm_memslots array was used to build this + shadow page; it is also used to go back from a struct kvm_mmu_page + to a memslot, through the kvm_memslots_for_spte_role macro and + __gfn_to_memslot. + role.ad_disabled: + Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D + bits before Haswell; shadow EPT page tables also cannot use A/D bits + if the L1 hypervisor does not enable them. + gfn: + Either the guest page table containing the translations shadowed by this + page, or the base page frame for linear translations. See role.direct. + spt: + A pageful of 64-bit sptes containing the translations for this page. + Accessed by both kvm and hardware. + The page pointed to by spt will have its page->private pointing back + at the shadow page structure. + sptes in spt point either at guest pages, or at lower-level shadow pages. + Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point + at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. + The spt array forms a DAG structure with the shadow page as a node, and + guest pages as leaves. + gfns: + An array of 512 guest frame numbers, one for each present pte. Used to + perform a reverse map from a pte to a gfn. When role.direct is set, any + element of this array can be calculated from the gfn field when used, in + this case, the array of gfns is not allocated. See role.direct and gfn. + root_count: + A counter keeping track of how many hardware registers (guest cr3 or + pdptrs) are now pointing at the page. While this counter is nonzero, the + page cannot be destroyed. See role.invalid. + parent_ptes: + The reverse mapping for the pte/ptes pointing at this page's spt. If + parent_ptes bit 0 is zero, only one spte points at this page and + parent_ptes points at this single spte, otherwise, there exists multiple + sptes pointing at this page and (parent_ptes & ~0x1) points at a data + structure with a list of parent sptes. + unsync: + If true, then the translations in this page may not match the guest's + translation. This is equivalent to the state of the tlb when a pte is + changed but before the tlb entry is flushed. Accordingly, unsync ptes + are synchronized when the guest executes invlpg or flushes its tlb by + other means. Valid for leaf pages. + unsync_children: + How many sptes in the page point at pages that are unsync (or have + unsynchronized children). + unsync_child_bitmap: + A bitmap indicating which sptes in spt point (directly or indirectly) at + pages that may be unsynchronized. Used to quickly locate all unsychronized + pages reachable from a given page. + clear_spte_count: + Only present on 32-bit hosts, where a 64-bit spte cannot be written + atomically. The reader uses this while running out of the MMU lock + to detect in-progress updates and retry them until the writer has + finished the write. + write_flooding_count: + A guest may write to a page table many times, causing a lot of + emulations if the page needs to be write-protected (see "Synchronized + and unsynchronized pages" below). Leaf pages can be unsynchronized + so that they do not trigger frequent emulation, but this is not + possible for non-leafs. This field counts the number of emulations + since the last time the page table was actually used; if emulation + is triggered too frequently on this page, KVM will unmap the page + to avoid emulation in the future. + +Reverse map +=========== + +The mmu maintains a reverse mapping whereby all ptes mapping a page can be +reached given its gfn. This is used, for example, when swapping out a page. + +Synchronized and unsynchronized pages +===================================== + +The guest uses two events to synchronize its tlb and page tables: tlb flushes +and page invalidations (invlpg). + +A tlb flush means that we need to synchronize all sptes reachable from the +guest's cr3. This is expensive, so we keep all guest page tables write +protected, and synchronize sptes to gptes when a gpte is written. + +A special case is when a guest page table is reachable from the current +guest cr3. In this case, the guest is obliged to issue an invlpg instruction +before using the translation. We take advantage of that by removing write +protection from the guest page, and allowing the guest to modify it freely. +We synchronize modified gptes when the guest invokes invlpg. This reduces +the amount of emulation we have to do when the guest modifies multiple gptes, +or when the a guest page is no longer used as a page table and is used for +random guest data. + +As a side effect we have to resynchronize all reachable unsynchronized shadow +pages on a tlb flush. + + +Reaction to events +================== + +- guest page fault (or npt page fault, or ept violation) + +This is the most complicated event. The cause of a page fault can be: + + - a true guest fault (the guest translation won't allow the access) (*) + - access to a missing translation + - access to a protected translation + - when logging dirty pages, memory is write protected + - synchronized shadow pages are write protected (*) + - access to untranslatable memory (mmio) + + (*) not applicable in direct mode + +Handling a page fault is performed as follows: + + - if the RSV bit of the error code is set, the page fault is caused by guest + accessing MMIO and cached MMIO information is available. + + - walk shadow page table + - check for valid generation number in the spte (see "Fast invalidation of + MMIO sptes" below) + - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and + vcpu->arch.mmio_gfn, and call the emulator + + - If both P bit and R/W bit of error code are set, this could possibly + be handled as a "fast page fault" (fixed without taking the MMU lock). See + the description in Documentation/virt/kvm/locking.rst. + + - if needed, walk the guest page tables to determine the guest translation + (gva->gpa or ngpa->gpa) + + - if permissions are insufficient, reflect the fault back to the guest + + - determine the host page + + - if this is an mmio request, there is no host page; cache the info to + vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn + + - walk the shadow page table to find the spte for the translation, + instantiating missing intermediate page tables as necessary + + - If this is an mmio request, cache the mmio info to the spte and set some + reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) + + - try to unsynchronize the page + + - if successful, we can let the guest continue and modify the gpte + + - emulate the instruction + + - if failed, unshadow the page and let the guest continue + + - update any translations that were modified by the instruction + +invlpg handling: + + - walk the shadow page hierarchy and drop affected translations + - try to reinstantiate the indicated translation in the hope that the + guest will use it in the near future + +Guest control register updates: + +- mov to cr3 + + - look up new shadow roots + - synchronize newly reachable shadow pages + +- mov to cr0/cr4/efer + + - set up mmu context for new paging mode + - look up new shadow roots + - synchronize newly reachable shadow pages + +Host translation updates: + + - mmu notifier called with updated hva + - look up affected sptes through reverse map + - drop (or update) translations + +Emulating cr0.wp +================ + +If tdp is not enabled, the host must keep cr0.wp=1 so page write protection +works for the guest kernel, not guest guest userspace. When the guest +cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, +we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the +semantics require allowing any guest kernel access plus user read access). + +We handle this by mapping the permissions to two possible sptes, depending +on fault type: + +- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, + disallows user access) +- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel + write access) + +(user write faults generate a #PF) + +In the first case there are two additional complications: + +- if CR4.SMEP is enabled: since we've turned the page into a kernel page, + the kernel may now execute it. We handle this by also setting spte.nx. + If we get a user fetch or read fault, we'll change spte.u=1 and + spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when + shadow paging is in use. +- if CR4.SMAP is disabled: since the page has been changed to a kernel + page, it can not be reused when CR4.SMAP is enabled. We set + CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, + here we do not care the case that CR4.SMAP is enabled since KVM will + directly inject #PF to guest due to failed permission check. + +To prevent an spte that was converted into a kernel page with cr0.wp=0 +from being written by the kernel after cr0.wp has changed to 1, we make +the value of cr0.wp part of the page role. This means that an spte created +with one value of cr0.wp cannot be used when cr0.wp has a different value - +it will simply be missed by the shadow page lookup code. A similar issue +exists when an spte created with cr0.wp=0 and cr4.smep=0 is used after +changing cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep +is also made a part of the page role. + +Large pages +=========== + +The mmu supports all combinations of large and small guest and host pages. +Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as +two separate 2M pages, on both guest and host, since the mmu always uses PAE +paging. + +To instantiate a large spte, four constraints must be satisfied: + +- the spte must point to a large host page +- the guest pte must be a large pte of at least equivalent size (if tdp is + enabled, there is no guest pte and this condition is satisfied) +- if the spte will be writeable, the large page frame may not overlap any + write-protected pages +- the guest page must be wholly contained by a single memory slot + +To check the last two conditions, the mmu maintains a ->disallow_lpage set of +arrays for each memory slot and large page size. Every write protected page +causes its disallow_lpage to be incremented, thus preventing instantiation of +a large spte. The frames at the end of an unaligned memory slot have +artificially inflated ->disallow_lpages so they can never be instantiated. + +Fast invalidation of MMIO sptes +=============================== + +As mentioned in "Reaction to events" above, kvm will cache MMIO +information in leaf sptes. When a new memslot is added or an existing +memslot is changed, this information may become stale and needs to be +invalidated. This also needs to hold the MMU lock while walking all +shadow pages, and is made more scalable with a similar technique. + +MMIO sptes have a few spare bits, which are used to store a +generation number. The global generation number is stored in +kvm_memslots(kvm)->generation, and increased whenever guest memory info +changes. + +When KVM finds an MMIO spte, it checks the generation number of the spte. +If the generation number of the spte does not equal the global generation +number, it will ignore the cached MMIO information and handle the page +fault through the slow path. + +Since only 18 bits are used to store generation-number on mmio spte, all +pages are zapped when there is an overflow. + +Unfortunately, a single memory access might access kvm_memslots(kvm) multiple +times, the last one happening when the generation number is retrieved and +stored into the MMIO spte. Thus, the MMIO spte might be created based on +out-of-date information, but with an up-to-date generation number. + +To avoid this, the generation number is incremented again after synchronize_srcu +returns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a +memslot update, while some SRCU readers might be using the old copy. We do not +want to use an MMIO sptes created with an odd generation number, and we can do +this without losing a bit in the MMIO spte. The "update in-progress" bit of the +generation is not stored in MMIO spte, and is so is implicitly zero when the +generation is extracted out of the spte. If KVM is unlucky and creates an MMIO +spte while an update is in-progress, the next access to the spte will always be +a cache miss. For example, a subsequent access during the update window will +miss due to the in-progress flag diverging, while an access after the update +window closes will have a higher generation number (as compared to the spte). + + +Further reading +=============== + +- NPT presentation from KVM Forum 2008 + https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf diff --git a/Documentation/virt/kvm/x86/msr.rst b/Documentation/virt/kvm/x86/msr.rst new file mode 100644 index 000000000000..9315fc385fb0 --- /dev/null +++ b/Documentation/virt/kvm/x86/msr.rst @@ -0,0 +1,391 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +KVM-specific MSRs +================= + +:Author: Glauber Costa , Red Hat Inc, 2010 + +KVM makes use of some custom MSRs to service some requests. + +Custom MSRs have a range reserved for them, that goes from +0x4b564d00 to 0x4b564dff. There are MSRs outside this area, +but they are deprecated and their use is discouraged. + +Custom MSR list +--------------- + +The current supported Custom MSR list is: + +MSR_KVM_WALL_CLOCK_NEW: + 0x4b564d00 + +data: + 4-byte alignment physical address of a memory area which must be + in guest RAM. This memory is expected to hold a copy of the following + structure:: + + struct pvclock_wall_clock { + u32 version; + u32 sec; + u32 nsec; + } __attribute__((__packed__)); + + whose data will be filled in by the hypervisor. The hypervisor is only + guaranteed to update this data at the moment of MSR write. + Users that want to reliably query this information more than once have + to write more than once to this MSR. Fields have the following meanings: + + version: + guest has to check version before and after grabbing + time information and check that they are both equal and even. + An odd version indicates an in-progress update. + + sec: + number of seconds for wallclock at time of boot. + + nsec: + number of nanoseconds for wallclock at time of boot. + + In order to get the current wallclock time, the system_time from + MSR_KVM_SYSTEM_TIME_NEW needs to be added. + + Note that although MSRs are per-CPU entities, the effect of this + particular MSR is global. + + Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid + leaf prior to usage. + +MSR_KVM_SYSTEM_TIME_NEW: + 0x4b564d01 + +data: + 4-byte aligned physical address of a memory area which must be in + guest RAM, plus an enable bit in bit 0. This memory is expected to hold + a copy of the following structure:: + + struct pvclock_vcpu_time_info { + u32 version; + u32 pad0; + u64 tsc_timestamp; + u64 system_time; + u32 tsc_to_system_mul; + s8 tsc_shift; + u8 flags; + u8 pad[2]; + } __attribute__((__packed__)); /* 32 bytes */ + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. + + Fields have the following meanings: + + version: + guest has to check version before and after grabbing + time information and check that they are both equal and even. + An odd version indicates an in-progress update. + + tsc_timestamp: + the tsc value at the current VCPU at the time + of the update of this structure. Guests can subtract this value + from current tsc to derive a notion of elapsed time since the + structure update. + + system_time: + a host notion of monotonic time, including sleep + time at the time this structure was last updated. Unit is + nanoseconds. + + tsc_to_system_mul: + multiplier to be used when converting + tsc-related quantity to nanoseconds + + tsc_shift: + shift to be used when converting tsc-related + quantity to nanoseconds. This shift will ensure that + multiplication with tsc_to_system_mul does not overflow. + A positive value denotes a left shift, a negative value + a right shift. + + The conversion from tsc to nanoseconds involves an additional + right shift by 32 bits. With this information, guests can + derive per-CPU time by doing:: + + time = (current_tsc - tsc_timestamp) + if (tsc_shift >= 0) + time <<= tsc_shift; + else + time >>= -tsc_shift; + time = (time * tsc_to_system_mul) >> 32 + time = time + system_time + + flags: + bits in this field indicate extended capabilities + coordinated between the guest and the hypervisor. Availability + of specific flags has to be checked in 0x40000001 cpuid leaf. + Current flags are: + + + +-----------+--------------+----------------------------------+ + | flag bit | cpuid bit | meaning | + +-----------+--------------+----------------------------------+ + | | | time measures taken across | + | 0 | 24 | multiple cpus are guaranteed to | + | | | be monotonic | + +-----------+--------------+----------------------------------+ + | | | guest vcpu has been paused by | + | 1 | N/A | the host | + | | | See 4.70 in api.txt | + +-----------+--------------+----------------------------------+ + + Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid + leaf prior to usage. + + +MSR_KVM_WALL_CLOCK: + 0x11 + +data and functioning: + same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. + + This MSR falls outside the reserved KVM range and may be removed in the + future. Its usage is deprecated. + + Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid + leaf prior to usage. + +MSR_KVM_SYSTEM_TIME: + 0x12 + +data and functioning: + same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. + + This MSR falls outside the reserved KVM range and may be removed in the + future. Its usage is deprecated. + + Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid + leaf prior to usage. + + The suggested algorithm for detecting kvmclock presence is then:: + + if (!kvm_para_available()) /* refer to cpuid.txt */ + return NON_PRESENT; + + flags = cpuid_eax(0x40000001); + if (flags & 3) { + msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; + msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; + return PRESENT; + } else if (flags & 0) { + msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; + msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; + return PRESENT; + } else + return NON_PRESENT; + +MSR_KVM_ASYNC_PF_EN: + 0x4b564d02 + +data: + Asynchronous page fault (APF) control MSR. + + Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area + which must be in guest RAM and must be zeroed. This memory is expected + to hold a copy of the following structure:: + + struct kvm_vcpu_pv_apf_data { + /* Used for 'page not present' events delivered via #PF */ + __u32 flags; + + /* Used for 'page ready' events delivered via interrupt notification */ + __u32 token; + + __u8 pad[56]; + __u32 enabled; + }; + + Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1 + when asynchronous page faults are enabled on the vcpu, 0 when disabled. + Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in + cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as + #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is + present in CPUID. Bit 3 enables interrupt based delivery of 'page ready' + events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in + CPUID. + + 'Page not present' events are currently always delivered as synthetic + #PF exception. During delivery of these events APF CR2 register contains + a token that will be used to notify the guest when missing page becomes + available. Also, to make it possible to distinguish between real #PF and + APF, first 4 bytes of 64 byte memory location ('flags') will be written + to by the hypervisor at the time of injection. Only first bit of 'flags' + is currently supported, when set, it indicates that the guest is dealing + with asynchronous 'page not present' event. If during a page fault APF + 'flags' is '0' it means that this is regular page fault. Guest is + supposed to clear 'flags' when it is done handling #PF exception so the + next event can be delivered. + + Note, since APF 'page not present' events use the same exception vector + as regular page fault, guest must reset 'flags' to '0' before it does + something that can generate normal page fault. + + Bytes 5-7 of 64 byte memory location ('token') will be written to by the + hypervisor at the time of APF 'page ready' event injection. The content + of these bytes is a token which was previously delivered as 'page not + present' event. The event indicates the page in now available. Guest is + supposed to write '0' to 'token' when it is done handling 'page ready' + event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location; + writing to the MSR forces KVM to re-scan its queue and deliver the next + pending notification. + + Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page + ready' APF delivery needs to be written to before enabling APF mechanism + in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is + available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + + Note, previously, 'page ready' events were delivered via the same #PF + exception as 'page not present' events but this is now deprecated. If + bit 3 (interrupt based delivery) is not set APF events are not delivered. + + If APF is disabled while there are outstanding APFs, they will + not be delivered. + + Currently 'page ready' APF events will be always delivered on the + same vcpu as 'page not present' event was, but guest should not rely on + that. + +MSR_KVM_STEAL_TIME: + 0x4b564d03 + +data: + 64-byte alignment physical address of a memory area which must be + in guest RAM, plus an enable bit in bit 0. This memory is expected to + hold a copy of the following structure:: + + struct kvm_steal_time { + __u64 steal; + __u32 version; + __u32 flags; + __u8 preempted; + __u8 u8_pad[3]; + __u32 pad[11]; + } + + whose data will be filled in by the hypervisor periodically. Only one + write, or registration, is needed for each VCPU. The interval between + updates of this structure is arbitrary and implementation-dependent. + The hypervisor may update this structure at any time it sees fit until + anything with bit0 == 0 is written to it. Guest is required to make sure + this structure is initialized to zero. + + Fields have the following meanings: + + version: + a sequence counter. In other words, guest has to check + this field before and after grabbing time information and make + sure they are both equal and even. An odd version indicates an + in-progress update. + + flags: + At this point, always zero. May be used to indicate + changes in this structure in the future. + + steal: + the amount of time in which this vCPU did not run, in + nanoseconds. Time during which the vcpu is idle, will not be + reported as steal time. + + preempted: + indicate the vCPU who owns this struct is running or + not. Non-zero values mean the vCPU has been preempted. Zero + means the vCPU is not preempted. NOTE, it is always zero if the + the hypervisor doesn't support this field. + +MSR_KVM_EOI_EN: + 0x4b564d04 + +data: + Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 + when disabled. Bit 1 is reserved and must be zero. When PV end of + interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned + physical address of a 4 byte memory area which must be in guest RAM and + must be zeroed. + + The first, least significant bit of 4 byte memory location will be + written to by the hypervisor, typically at the time of interrupt + injection. Value of 1 means that guest can skip writing EOI to the apic + (using MSR or MMIO write); instead, it is sufficient to signal + EOI by clearing the bit in guest memory - this location will + later be polled by the hypervisor. + Value of 0 means that the EOI write is required. + + It is always safe for the guest to ignore the optimization and perform + the APIC EOI write anyway. + + Hypervisor is guaranteed to only modify this least + significant bit while in the current VCPU context, this means that + guest does not need to use either lock prefix or memory ordering + primitives to synchronise with the hypervisor. + + However, hypervisor can set and clear this memory bit at any time: + therefore to make sure hypervisor does not interrupt the + guest and clear the least significant bit in the memory area + in the window between guest testing it to detect + whether it can skip EOI apic write and between guest + clearing it to signal EOI to the hypervisor, + guest must both read the least significant bit in the memory area and + clear it using a single CPU instruction, such as test and clear, or + compare and exchange. + +MSR_KVM_POLL_CONTROL: + 0x4b564d05 + + Control host-side polling. + +data: + Bit 0 enables (1) or disables (0) host-side HLT polling logic. + + KVM guests can request the host not to poll on HLT, for example if + they are performing polling themselves. + +MSR_KVM_ASYNC_PF_INT: + 0x4b564d06 + +data: + Second asynchronous page fault (APF) control MSR. + + Bits 0-7: APIC vector for delivery of 'page ready' APF events. + Bits 8-63: Reserved + + Interrupt vector for asynchnonous 'page ready' notifications delivery. + The vector has to be set up before asynchronous page fault mechanism + is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if + KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + +MSR_KVM_ASYNC_PF_ACK: + 0x4b564d07 + +data: + Asynchronous page fault (APF) acknowledgment. + + When the guest is done processing 'page ready' APF event and 'token' + field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to + write '1' to bit 0 of the MSR, this causes the host to re-scan its queue + and check if there are more notifications pending. The MSR is available + if KVM_FEATURE_ASYNC_PF_INT is present in CPUID. + +MSR_KVM_MIGRATION_CONTROL: + 0x4b564d08 + +data: + This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in + CPUID. Bit 0 represents whether live migration of the guest is allowed. + + When a guest is started, bit 0 will be 0 if the guest has encrypted + memory and 1 if the guest does not have encrypted memory. If the + guest is communicating page encryption status to the host using the + ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to + allow live migration of the guest. diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst new file mode 100644 index 000000000000..ac2095d41f02 --- /dev/null +++ b/Documentation/virt/kvm/x86/nested-vmx.rst @@ -0,0 +1,244 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +Nested VMX +========== + +Overview +--------- + +On Intel processors, KVM uses Intel's VMX (Virtual-Machine eXtensions) +to easily and efficiently run guest operating systems. Normally, these guests +*cannot* themselves be hypervisors running their own guests, because in VMX, +guests cannot use VMX instructions. + +The "Nested VMX" feature adds this missing capability - of running guest +hypervisors (which use VMX) with their own nested guests. It does so by +allowing a guest to use VMX instructions, and correctly and efficiently +emulating them using the single level of VMX available in the hardware. + +We describe in much greater detail the theory behind the nested VMX feature, +its implementation and its performance characteristics, in the OSDI 2010 paper +"The Turtles Project: Design and Implementation of Nested Virtualization", +available at: + + https://www.usenix.org/events/osdi10/tech/full_papers/Ben-Yehuda.pdf + + +Terminology +----------- + +Single-level virtualization has two levels - the host (KVM) and the guests. +In nested virtualization, we have three levels: The host (KVM), which we call +L0, the guest hypervisor, which we call L1, and its nested guest, which we +call L2. + + +Running nested VMX +------------------ + +The nested VMX feature is enabled by default since Linux kernel v4.20. For +older Linux kernel, it can be enabled by giving the "nested=1" option to the +kvm-intel module. + + +No modifications are required to user space (qemu). However, qemu's default +emulated CPU type (qemu64) does not list the "VMX" CPU feature, so it must be +explicitly enabled, by giving qemu one of the following options: + + - cpu host (emulated CPU has all features of the real CPU) + + - cpu qemu64,+vmx (add just the vmx feature to a named CPU type) + + +ABIs +---- + +Nested VMX aims to present a standard and (eventually) fully-functional VMX +implementation for the a guest hypervisor to use. As such, the official +specification of the ABI that it provides is Intel's VMX specification, +namely volume 3B of their "Intel 64 and IA-32 Architectures Software +Developer's Manual". Not all of VMX's features are currently fully supported, +but the goal is to eventually support them all, starting with the VMX features +which are used in practice by popular hypervisors (KVM and others). + +As a VMX implementation, nested VMX presents a VMCS structure to L1. +As mandated by the spec, other than the two fields revision_id and abort, +this structure is *opaque* to its user, who is not supposed to know or care +about its internal structure. Rather, the structure is accessed through the +VMREAD and VMWRITE instructions. +Still, for debugging purposes, KVM developers might be interested to know the +internals of this structure; This is struct vmcs12 from arch/x86/kvm/vmx.c. + +The name "vmcs12" refers to the VMCS that L1 builds for L2. In the code we +also have "vmcs01", the VMCS that L0 built for L1, and "vmcs02" is the VMCS +which L0 builds to actually run L2 - how this is done is explained in the +aforementioned paper. + +For convenience, we repeat the content of struct vmcs12 here. If the internals +of this structure changes, this can break live migration across KVM versions. +VMCS12_REVISION (from vmx.c) should be changed if struct vmcs12 or its inner +struct shadow_vmcs is ever changed. + +:: + + typedef u64 natural_width; + struct __packed vmcs12 { + /* According to the Intel spec, a VMCS region must start with + * these two user-visible fields */ + u32 revision_id; + u32 abort; + + u32 launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */ + u32 padding[7]; /* room for future expansion */ + + u64 io_bitmap_a; + u64 io_bitmap_b; + u64 msr_bitmap; + u64 vm_exit_msr_store_addr; + u64 vm_exit_msr_load_addr; + u64 vm_entry_msr_load_addr; + u64 tsc_offset; + u64 virtual_apic_page_addr; + u64 apic_access_addr; + u64 ept_pointer; + u64 guest_physical_address; + u64 vmcs_link_pointer; + u64 guest_ia32_debugctl; + u64 guest_ia32_pat; + u64 guest_ia32_efer; + u64 guest_pdptr0; + u64 guest_pdptr1; + u64 guest_pdptr2; + u64 guest_pdptr3; + u64 host_ia32_pat; + u64 host_ia32_efer; + u64 padding64[8]; /* room for future expansion */ + natural_width cr0_guest_host_mask; + natural_width cr4_guest_host_mask; + natural_width cr0_read_shadow; + natural_width cr4_read_shadow; + natural_width dead_space[4]; /* Last remnants of cr3_target_value[0-3]. */ + natural_width exit_qualification; + natural_width guest_linear_address; + natural_width guest_cr0; + natural_width guest_cr3; + natural_width guest_cr4; + natural_width guest_es_base; + natural_width guest_cs_base; + natural_width guest_ss_base; + natural_width guest_ds_base; + natural_width guest_fs_base; + natural_width guest_gs_base; + natural_width guest_ldtr_base; + natural_width guest_tr_base; + natural_width guest_gdtr_base; + natural_width guest_idtr_base; + natural_width guest_dr7; + natural_width guest_rsp; + natural_width guest_rip; + natural_width guest_rflags; + natural_width guest_pending_dbg_exceptions; + natural_width guest_sysenter_esp; + natural_width guest_sysenter_eip; + natural_width host_cr0; + natural_width host_cr3; + natural_width host_cr4; + natural_width host_fs_base; + natural_width host_gs_base; + natural_width host_tr_base; + natural_width host_gdtr_base; + natural_width host_idtr_base; + natural_width host_ia32_sysenter_esp; + natural_width host_ia32_sysenter_eip; + natural_width host_rsp; + natural_width host_rip; + natural_width paddingl[8]; /* room for future expansion */ + u32 pin_based_vm_exec_control; + u32 cpu_based_vm_exec_control; + u32 exception_bitmap; + u32 page_fault_error_code_mask; + u32 page_fault_error_code_match; + u32 cr3_target_count; + u32 vm_exit_controls; + u32 vm_exit_msr_store_count; + u32 vm_exit_msr_load_count; + u32 vm_entry_controls; + u32 vm_entry_msr_load_count; + u32 vm_entry_intr_info_field; + u32 vm_entry_exception_error_code; + u32 vm_entry_instruction_len; + u32 tpr_threshold; + u32 secondary_vm_exec_control; + u32 vm_instruction_error; + u32 vm_exit_reason; + u32 vm_exit_intr_info; + u32 vm_exit_intr_error_code; + u32 idt_vectoring_info_field; + u32 idt_vectoring_error_code; + u32 vm_exit_instruction_len; + u32 vmx_instruction_info; + u32 guest_es_limit; + u32 guest_cs_limit; + u32 guest_ss_limit; + u32 guest_ds_limit; + u32 guest_fs_limit; + u32 guest_gs_limit; + u32 guest_ldtr_limit; + u32 guest_tr_limit; + u32 guest_gdtr_limit; + u32 guest_idtr_limit; + u32 guest_es_ar_bytes; + u32 guest_cs_ar_bytes; + u32 guest_ss_ar_bytes; + u32 guest_ds_ar_bytes; + u32 guest_fs_ar_bytes; + u32 guest_gs_ar_bytes; + u32 guest_ldtr_ar_bytes; + u32 guest_tr_ar_bytes; + u32 guest_interruptibility_info; + u32 guest_activity_state; + u32 guest_sysenter_cs; + u32 host_ia32_sysenter_cs; + u32 padding32[8]; /* room for future expansion */ + u16 virtual_processor_id; + u16 guest_es_selector; + u16 guest_cs_selector; + u16 guest_ss_selector; + u16 guest_ds_selector; + u16 guest_fs_selector; + u16 guest_gs_selector; + u16 guest_ldtr_selector; + u16 guest_tr_selector; + u16 host_es_selector; + u16 host_cs_selector; + u16 host_ss_selector; + u16 host_ds_selector; + u16 host_fs_selector; + u16 host_gs_selector; + u16 host_tr_selector; + }; + + +Authors +------- + +These patches were written by: + - Abel Gordon, abelg il.ibm.com + - Nadav Har'El, nyh il.ibm.com + - Orit Wasserman, oritw il.ibm.com + - Ben-Ami Yassor, benami il.ibm.com + - Muli Ben-Yehuda, muli il.ibm.com + +With contributions by: + - Anthony Liguori, aliguori us.ibm.com + - Mike Day, mdday us.ibm.com + - Michael Factor, factor il.ibm.com + - Zvi Dubitzky, dubi il.ibm.com + +And valuable reviews by: + - Avi Kivity, avi redhat.com + - Gleb Natapov, gleb redhat.com + - Marcelo Tosatti, mtosatti redhat.com + - Kevin Tian, kevin.tian intel.com + - and others. diff --git a/Documentation/virt/kvm/x86/running-nested-guests.rst b/Documentation/virt/kvm/x86/running-nested-guests.rst new file mode 100644 index 000000000000..bd70c69468ae --- /dev/null +++ b/Documentation/virt/kvm/x86/running-nested-guests.rst @@ -0,0 +1,276 @@ +============================== +Running nested guests with KVM +============================== + +A nested guest is the ability to run a guest inside another guest (it +can be KVM-based or a different hypervisor). The straightforward +example is a KVM guest that in turn runs on a KVM guest (the rest of +this document is built on this example):: + + .----------------. .----------------. + | | | | + | L2 | | L2 | + | (Nested Guest) | | (Nested Guest) | + | | | | + |----------------'--'----------------| + | | + | L1 (Guest Hypervisor) | + | KVM (/dev/kvm) | + | | + .------------------------------------------------------. + | L0 (Host Hypervisor) | + | KVM (/dev/kvm) | + |------------------------------------------------------| + | Hardware (with virtualization extensions) | + '------------------------------------------------------' + +Terminology: + +- L0 – level-0; the bare metal host, running KVM + +- L1 – level-1 guest; a VM running on L0; also called the "guest + hypervisor", as it itself is capable of running KVM. + +- L2 – level-2 guest; a VM running on L1, this is the "nested guest" + +.. note:: The above diagram is modelled after the x86 architecture; + s390x, ppc64 and other architectures are likely to have + a different design for nesting. + + For example, s390x always has an LPAR (LogicalPARtition) + hypervisor running on bare metal, adding another layer and + resulting in at least four levels in a nested setup — L0 (bare + metal, running the LPAR hypervisor), L1 (host hypervisor), L2 + (guest hypervisor), L3 (nested guest). + + This document will stick with the three-level terminology (L0, + L1, and L2) for all architectures; and will largely focus on + x86. + + +Use Cases +--------- + +There are several scenarios where nested KVM can be useful, to name a +few: + +- As a developer, you want to test your software on different operating + systems (OSes). Instead of renting multiple VMs from a Cloud + Provider, using nested KVM lets you rent a large enough "guest + hypervisor" (level-1 guest). This in turn allows you to create + multiple nested guests (level-2 guests), running different OSes, on + which you can develop and test your software. + +- Live migration of "guest hypervisors" and their nested guests, for + load balancing, disaster recovery, etc. + +- VM image creation tools (e.g. ``virt-install``, etc) often run + their own VM, and users expect these to work inside a VM. + +- Some OSes use virtualization internally for security (e.g. to let + applications run safely in isolation). + + +Enabling "nested" (x86) +----------------------- + +From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled +by default for Intel and AMD. (Though your Linux distribution might +override this default.) + +In case you are running a Linux kernel older than v4.19, to enable +nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To +persist this setting across reboots, you can add it in a config file, as +shown below: + +1. On the bare metal host (L0), list the kernel modules and ensure that + the KVM modules:: + + $ lsmod | grep -i kvm + kvm_intel 133627 0 + kvm 435079 1 kvm_intel + +2. Show information for ``kvm_intel`` module:: + + $ modinfo kvm_intel | grep -i nested + parm: nested:bool + +3. For the nested KVM configuration to persist across reboots, place the + below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it + doesn't exist):: + + $ cat /etc/modprobe.d/kvm_intel.conf + options kvm-intel nested=y + +4. Unload and re-load the KVM Intel module:: + + $ sudo rmmod kvm-intel + $ sudo modprobe kvm-intel + +5. Verify if the ``nested`` parameter for KVM is enabled:: + + $ cat /sys/module/kvm_intel/parameters/nested + Y + +For AMD hosts, the process is the same as above, except that the module +name is ``kvm-amd``. + + +Additional nested-related kernel parameters (x86) +------------------------------------------------- + +If your hardware is sufficiently advanced (Intel Haswell processor or +higher, which has newer hardware virt extensions), the following +additional features will also be enabled by default: "Shadow VMCS +(Virtual Machine Control Structure)", APIC Virtualization on your bare +metal host (L0). Parameters for Intel hosts:: + + $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs + Y + + $ cat /sys/module/kvm_intel/parameters/enable_apicv + Y + + $ cat /sys/module/kvm_intel/parameters/ept + Y + +.. note:: If you suspect your L2 (i.e. nested guest) is running slower, + ensure the above are enabled (particularly + ``enable_shadow_vmcs`` and ``ept``). + + +Starting a nested guest (x86) +----------------------------- + +Once your bare metal host (L0) is configured for nesting, you should be +able to start an L1 guest with:: + + $ qemu-kvm -cpu host [...] + +The above will pass through the host CPU's capabilities as-is to the +gues); or for better live migration compatibility, use a named CPU +model supported by QEMU. e.g.:: + + $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on + +then the guest hypervisor will subsequently be capable of running a +nested guest with accelerated KVM. + + +Enabling "nested" (s390x) +------------------------- + +1. On the host hypervisor (L0), enable the ``nested`` parameter on + s390x:: + + $ rmmod kvm + $ modprobe kvm nested=1 + +.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive + with the ``nested`` paramter — i.e. to be able to enable + ``nested``, the ``hpage`` parameter *must* be disabled. + +2. The guest hypervisor (L1) must be provided with the ``sie`` CPU + feature — with QEMU, this can be done by using "host passthrough" + (via the command-line ``-cpu host``). + +3. Now the KVM module can be loaded in the L1 (guest hypervisor):: + + $ modprobe kvm + + +Live migration with nested KVM +------------------------------ + +Migrating an L1 guest, with a *live* nested guest in it, to another +bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for +Intel x86 systems, and even on older versions for s390x. + +On AMD systems, once an L1 guest has started an L2 guest, the L1 guest +should no longer be migrated or saved (refer to QEMU documentation on +"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate +or save-and-load an L1 guest while an L2 guest is running will result in +undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a +kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 +guest can no longer be considered stable or secure, and must be restarted. +Migrating an L1 guest merely configured to support nesting, while not +actually running L2 guests, is expected to function normally even on AMD +systems but may fail once guests are started. + +Migrating an L2 guest is always expected to succeed, so all the following +scenarios should work even on AMD systems: + +- Migrating a nested guest (L2) to another L1 guest on the *same* bare + metal host. + +- Migrating a nested guest (L2) to another L1 guest on a *different* + bare metal host. + +- Migrating a nested guest (L2) to a bare metal host. + +Reporting bugs from nested setups +----------------------------------- + +Debugging "nested" problems can involve sifting through log files across +L0, L1 and L2; this can result in tedious back-n-forth between the bug +reporter and the bug fixer. + +- Mention that you are in a "nested" setup. If you are running any kind + of "nesting" at all, say so. Unfortunately, this needs to be called + out because when reporting bugs, people tend to forget to even + *mention* that they're using nested virtualization. + +- Ensure you are actually running KVM on KVM. Sometimes people do not + have KVM enabled for their guest hypervisor (L1), which results in + them running with pure emulation or what QEMU calls it as "TCG", but + they think they're running nested KVM. Thus confusing "nested Virt" + (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). + +Information to collect (generic) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The following is not an exhaustive list, but a very good starting point: + + - Kernel, libvirt, and QEMU version from L0 + + - Kernel, libvirt and QEMU version from L1 + + - QEMU command-line of L1 -- when using libvirt, you'll find it here: + ``/var/log/libvirt/qemu/instance.log`` + + - QEMU command-line of L2 -- as above, when using libvirt, get the + complete libvirt-generated QEMU command-line + + - ``cat /sys/cpuinfo`` from L0 + + - ``cat /sys/cpuinfo`` from L1 + + - ``lscpu`` from L0 + + - ``lscpu`` from L1 + + - Full ``dmesg`` output from L0 + + - Full ``dmesg`` output from L1 + +x86-specific info to collect +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Both the below commands, ``x86info`` and ``dmidecode``, should be +available on most Linux distributions with the same name: + + - Output of: ``x86info -a`` from L0 + + - Output of: ``x86info -a`` from L1 + + - Output of: ``dmidecode`` from L0 + + - Output of: ``dmidecode`` from L1 + +s390x-specific info to collect +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Along with the earlier mentioned generic details, the below is +also recommended: + + - ``/proc/sysinfo`` from L1; this will also include the info from L0 diff --git a/Documentation/virt/kvm/x86/timekeeping.rst b/Documentation/virt/kvm/x86/timekeeping.rst new file mode 100644 index 000000000000..21ae7efa29ba --- /dev/null +++ b/Documentation/virt/kvm/x86/timekeeping.rst @@ -0,0 +1,645 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================== +Timekeeping Virtualization for X86-Based Architectures +====================================================== + +:Author: Zachary Amsden +:Copyright: (c) 2010, Red Hat. All rights reserved. + +.. Contents + + 1) Overview + 2) Timing Devices + 3) TSC Hardware + 4) Virtualization Problems + +1. Overview +=========== + +One of the most complicated parts of the X86 platform, and specifically, +the virtualization of this platform is the plethora of timing devices available +and the complexity of emulating those devices. In addition, virtualization of +time introduces a new set of challenges because it introduces a multiplexed +division of time beyond the control of the guest CPU. + +First, we will describe the various timekeeping hardware available, then +present some of the problems which arise and solutions available, giving +specific recommendations for certain classes of KVM guests. + +The purpose of this document is to collect data and information relevant to +timekeeping which may be difficult to find elsewhere, specifically, +information relevant to KVM and hardware-based virtualization. + +2. Timing Devices +================= + +First we discuss the basic hardware devices available. TSC and the related +KVM clock are special enough to warrant a full exposition and are described in +the following section. + +2.1. i8254 - PIT +---------------- + +One of the first timer devices available is the programmable interrupt timer, +or PIT. The PIT has a fixed frequency 1.193182 MHz base clock and three +channels which can be programmed to deliver periodic or one-shot interrupts. +These three channels can be configured in different modes and have individual +counters. Channel 1 and 2 were not available for general use in the original +IBM PC, and historically were connected to control RAM refresh and the PC +speaker. Now the PIT is typically integrated as part of an emulated chipset +and a separate physical PIT is not used. + +The PIT uses I/O ports 0x40 - 0x43. Access to the 16-bit counters is done +using single or multiple byte access to the I/O ports. There are 6 modes +available, but not all modes are available to all timers, as only timer 2 +has a connected gate input, required for modes 1 and 5. The gate line is +controlled by port 61h, bit 0, as illustrated in the following diagram:: + + -------------- ---------------- + | | | | + | 1.1932 MHz|---------->| CLOCK OUT | ---------> IRQ 0 + | Clock | | | | + -------------- | +->| GATE TIMER 0 | + | ---------------- + | + | ---------------- + | | | + |------>| CLOCK OUT | ---------> 66.3 KHZ DRAM + | | | (aka /dev/null) + | +->| GATE TIMER 1 | + | ---------------- + | + | ---------------- + | | | + |------>| CLOCK OUT | ---------> Port 61h, bit 5 + | | | + Port 61h, bit 0 -------->| GATE TIMER 2 | \_.---- ____ + ---------------- _| )--|LPF|---Speaker + / *---- \___/ + Port 61h, bit 1 ---------------------------------/ + +The timer modes are now described. + +Mode 0: Single Timeout. + This is a one-shot software timeout that counts down + when the gate is high (always true for timers 0 and 1). When the count + reaches zero, the output goes high. + +Mode 1: Triggered One-shot. + The output is initially set high. When the gate + line is set high, a countdown is initiated (which does not stop if the gate is + lowered), during which the output is set low. When the count reaches zero, + the output goes high. + +Mode 2: Rate Generator. + The output is initially set high. When the countdown + reaches 1, the output goes low for one count and then returns high. The value + is reloaded and the countdown automatically resumes. If the gate line goes + low, the count is halted. If the output is low when the gate is lowered, the + output automatically goes high (this only affects timer 2). + +Mode 3: Square Wave. + This generates a high / low square wave. The count + determines the length of the pulse, which alternates between high and low + when zero is reached. The count only proceeds when gate is high and is + automatically reloaded on reaching zero. The count is decremented twice at + each clock to generate a full high / low cycle at the full periodic rate. + If the count is even, the clock remains high for N/2 counts and low for N/2 + counts; if the clock is odd, the clock is high for (N+1)/2 counts and low + for (N-1)/2 counts. Only even values are latched by the counter, so odd + values are not observed when reading. This is the intended mode for timer 2, + which generates sine-like tones by low-pass filtering the square wave output. + +Mode 4: Software Strobe. + After programming this mode and loading the counter, + the output remains high until the counter reaches zero. Then the output + goes low for 1 clock cycle and returns high. The counter is not reloaded. + Counting only occurs when gate is high. + +Mode 5: Hardware Strobe. + After programming and loading the counter, the + output remains high. When the gate is raised, a countdown is initiated + (which does not stop if the gate is lowered). When the counter reaches zero, + the output goes low for 1 clock cycle and then returns high. The counter is + not reloaded. + +In addition to normal binary counting, the PIT supports BCD counting. The +command port, 0x43 is used to set the counter and mode for each of the three +timers. + +PIT commands, issued to port 0x43, using the following bit encoding:: + + Bit 7-4: Command (See table below) + Bit 3-1: Mode (000 = Mode 0, 101 = Mode 5, 11X = undefined) + Bit 0 : Binary (0) / BCD (1) + +Command table:: + + 0000 - Latch Timer 0 count for port 0x40 + sample and hold the count to be read in port 0x40; + additional commands ignored until counter is read; + mode bits ignored. + + 0001 - Set Timer 0 LSB mode for port 0x40 + set timer to read LSB only and force MSB to zero; + mode bits set timer mode + + 0010 - Set Timer 0 MSB mode for port 0x40 + set timer to read MSB only and force LSB to zero; + mode bits set timer mode + + 0011 - Set Timer 0 16-bit mode for port 0x40 + set timer to read / write LSB first, then MSB; + mode bits set timer mode + + 0100 - Latch Timer 1 count for port 0x41 - as described above + 0101 - Set Timer 1 LSB mode for port 0x41 - as described above + 0110 - Set Timer 1 MSB mode for port 0x41 - as described above + 0111 - Set Timer 1 16-bit mode for port 0x41 - as described above + + 1000 - Latch Timer 2 count for port 0x42 - as described above + 1001 - Set Timer 2 LSB mode for port 0x42 - as described above + 1010 - Set Timer 2 MSB mode for port 0x42 - as described above + 1011 - Set Timer 2 16-bit mode for port 0x42 as described above + + 1101 - General counter latch + Latch combination of counters into corresponding ports + Bit 3 = Counter 2 + Bit 2 = Counter 1 + Bit 1 = Counter 0 + Bit 0 = Unused + + 1110 - Latch timer status + Latch combination of counter mode into corresponding ports + Bit 3 = Counter 2 + Bit 2 = Counter 1 + Bit 1 = Counter 0 + + The output of ports 0x40-0x42 following this command will be: + + Bit 7 = Output pin + Bit 6 = Count loaded (0 if timer has expired) + Bit 5-4 = Read / Write mode + 01 = MSB only + 10 = LSB only + 11 = LSB / MSB (16-bit) + Bit 3-1 = Mode + Bit 0 = Binary (0) / BCD mode (1) + +2.2. RTC +-------- + +The second device which was available in the original PC was the MC146818 real +time clock. The original device is now obsolete, and usually emulated by the +system chipset, sometimes by an HPET and some frankenstein IRQ routing. + +The RTC is accessed through CMOS variables, which uses an index register to +control which bytes are read. Since there is only one index register, read +of the CMOS and read of the RTC require lock protection (in addition, it is +dangerous to allow userspace utilities such as hwclock to have direct RTC +access, as they could corrupt kernel reads and writes of CMOS memory). + +The RTC generates an interrupt which is usually routed to IRQ 8. The interrupt +can function as a periodic timer, an additional once a day alarm, and can issue +interrupts after an update of the CMOS registers by the MC146818 is complete. +The type of interrupt is signalled in the RTC status registers. + +The RTC will update the current time fields by battery power even while the +system is off. The current time fields should not be read while an update is +in progress, as indicated in the status register. + +The clock uses a 32.768kHz crystal, so bits 6-4 of register A should be +programmed to a 32kHz divider if the RTC is to count seconds. + +This is the RAM map originally used for the RTC/CMOS:: + + Location Size Description + ------------------------------------------ + 00h byte Current second (BCD) + 01h byte Seconds alarm (BCD) + 02h byte Current minute (BCD) + 03h byte Minutes alarm (BCD) + 04h byte Current hour (BCD) + 05h byte Hours alarm (BCD) + 06h byte Current day of week (BCD) + 07h byte Current day of month (BCD) + 08h byte Current month (BCD) + 09h byte Current year (BCD) + 0Ah byte Register A + bit 7 = Update in progress + bit 6-4 = Divider for clock + 000 = 4.194 MHz + 001 = 1.049 MHz + 010 = 32 kHz + 10X = test modes + 110 = reset / disable + 111 = reset / disable + bit 3-0 = Rate selection for periodic interrupt + 000 = periodic timer disabled + 001 = 3.90625 uS + 010 = 7.8125 uS + 011 = .122070 mS + 100 = .244141 mS + ... + 1101 = 125 mS + 1110 = 250 mS + 1111 = 500 mS + 0Bh byte Register B + bit 7 = Run (0) / Halt (1) + bit 6 = Periodic interrupt enable + bit 5 = Alarm interrupt enable + bit 4 = Update-ended interrupt enable + bit 3 = Square wave interrupt enable + bit 2 = BCD calendar (0) / Binary (1) + bit 1 = 12-hour mode (0) / 24-hour mode (1) + bit 0 = 0 (DST off) / 1 (DST enabled) + OCh byte Register C (read only) + bit 7 = interrupt request flag (IRQF) + bit 6 = periodic interrupt flag (PF) + bit 5 = alarm interrupt flag (AF) + bit 4 = update interrupt flag (UF) + bit 3-0 = reserved + ODh byte Register D (read only) + bit 7 = RTC has power + bit 6-0 = reserved + 32h byte Current century BCD (*) + (*) location vendor specific and now determined from ACPI global tables + +2.3. APIC +--------- + +On Pentium and later processors, an on-board timer is available to each CPU +as part of the Advanced Programmable Interrupt Controller. The APIC is +accessed through memory-mapped registers and provides interrupt service to each +CPU, used for IPIs and local timer interrupts. + +Although in theory the APIC is a safe and stable source for local interrupts, +in practice, many bugs and glitches have occurred due to the special nature of +the APIC CPU-local memory-mapped hardware. Beware that CPU errata may affect +the use of the APIC and that workarounds may be required. In addition, some of +these workarounds pose unique constraints for virtualization - requiring either +extra overhead incurred from extra reads of memory-mapped I/O or additional +functionality that may be more computationally expensive to implement. + +Since the APIC is documented quite well in the Intel and AMD manuals, we will +avoid repetition of the detail here. It should be pointed out that the APIC +timer is programmed through the LVT (local vector timer) register, is capable +of one-shot or periodic operation, and is based on the bus clock divided down +by the programmable divider register. + +2.4. HPET +--------- + +HPET is quite complex, and was originally intended to replace the PIT / RTC +support of the X86 PC. It remains to be seen whether that will be the case, as +the de facto standard of PC hardware is to emulate these older devices. Some +systems designated as legacy free may support only the HPET as a hardware timer +device. + +The HPET spec is rather loose and vague, requiring at least 3 hardware timers, +but allowing implementation freedom to support many more. It also imposes no +fixed rate on the timer frequency, but does impose some extremal values on +frequency, error and slew. + +In general, the HPET is recommended as a high precision (compared to PIT /RTC) +time source which is independent of local variation (as there is only one HPET +in any given system). The HPET is also memory-mapped, and its presence is +indicated through ACPI tables by the BIOS. + +Detailed specification of the HPET is beyond the current scope of this +document, as it is also very well documented elsewhere. + +2.5. Offboard Timers +-------------------- + +Several cards, both proprietary (watchdog boards) and commonplace (e1000) have +timing chips built into the cards which may have registers which are accessible +to kernel or user drivers. To the author's knowledge, using these to generate +a clocksource for a Linux or other kernel has not yet been attempted and is in +general frowned upon as not playing by the agreed rules of the game. Such a +timer device would require additional support to be virtualized properly and is +not considered important at this time as no known operating system does this. + +3. TSC Hardware +=============== + +The TSC or time stamp counter is relatively simple in theory; it counts +instruction cycles issued by the processor, which can be used as a measure of +time. In practice, due to a number of problems, it is the most complicated +timekeeping device to use. + +The TSC is represented internally as a 64-bit MSR which can be read with the +RDMSR, RDTSC, or RDTSCP (when available) instructions. In the past, hardware +limitations made it possible to write the TSC, but generally on old hardware it +was only possible to write the low 32-bits of the 64-bit counter, and the upper +32-bits of the counter were cleared. Now, however, on Intel processors family +0Fh, for models 3, 4 and 6, and family 06h, models e and f, this restriction +has been lifted and all 64-bits are writable. On AMD systems, the ability to +write the TSC MSR is not an architectural guarantee. + +The TSC is accessible from CPL-0 and conditionally, for CPL > 0 software by +means of the CR4.TSD bit, which when enabled, disables CPL > 0 TSC access. + +Some vendors have implemented an additional instruction, RDTSCP, which returns +atomically not just the TSC, but an indicator which corresponds to the +processor number. This can be used to index into an array of TSC variables to +determine offset information in SMP systems where TSCs are not synchronized. +The presence of this instruction must be determined by consulting CPUID feature +bits. + +Both VMX and SVM provide extension fields in the virtualization hardware which +allows the guest visible TSC to be offset by a constant. Newer implementations +promise to allow the TSC to additionally be scaled, but this hardware is not +yet widely available. + +3.1. TSC synchronization +------------------------ + +The TSC is a CPU-local clock in most implementations. This means, on SMP +platforms, the TSCs of different CPUs may start at different times depending +on when the CPUs are powered on. Generally, CPUs on the same die will share +the same clock, however, this is not always the case. + +The BIOS may attempt to resynchronize the TSCs during the poweron process and +the operating system or other system software may attempt to do this as well. +Several hardware limitations make the problem worse - if it is not possible to +write the full 64-bits of the TSC, it may be impossible to match the TSC in +newly arriving CPUs to that of the rest of the system, resulting in +unsynchronized TSCs. This may be done by BIOS or system software, but in +practice, getting a perfectly synchronized TSC will not be possible unless all +values are read from the same clock, which generally only is possible on single +socket systems or those with special hardware support. + +3.2. TSC and CPU hotplug +------------------------ + +As touched on already, CPUs which arrive later than the boot time of the system +may not have a TSC value that is synchronized with the rest of the system. +Either system software, BIOS, or SMM code may actually try to establish the TSC +to a value matching the rest of the system, but a perfect match is usually not +a guarantee. This can have the effect of bringing a system from a state where +TSC is synchronized back to a state where TSC synchronization flaws, however +small, may be exposed to the OS and any virtualization environment. + +3.3. TSC and multi-socket / NUMA +-------------------------------- + +Multi-socket systems, especially large multi-socket systems are likely to have +individual clocksources rather than a single, universally distributed clock. +Since these clocks are driven by different crystals, they will not have +perfectly matched frequency, and temperature and electrical variations will +cause the CPU clocks, and thus the TSCs to drift over time. Depending on the +exact clock and bus design, the drift may or may not be fixed in absolute +error, and may accumulate over time. + +In addition, very large systems may deliberately slew the clocks of individual +cores. This technique, known as spread-spectrum clocking, reduces EMI at the +clock frequency and harmonics of it, which may be required to pass FCC +standards for telecommunications and computer equipment. + +It is recommended not to trust the TSCs to remain synchronized on NUMA or +multiple socket systems for these reasons. + +3.4. TSC and C-states +--------------------- + +C-states, or idling states of the processor, especially C1E and deeper sleep +states may be problematic for TSC as well. The TSC may stop advancing in such +a state, resulting in a TSC which is behind that of other CPUs when execution +is resumed. Such CPUs must be detected and flagged by the operating system +based on CPU and chipset identifications. + +The TSC in such a case may be corrected by catching it up to a known external +clocksource. + +3.5. TSC frequency change / P-states +------------------------------------ + +To make things slightly more interesting, some CPUs may change frequency. They +may or may not run the TSC at the same rate, and because the frequency change +may be staggered or slewed, at some points in time, the TSC rate may not be +known other than falling within a range of values. In this case, the TSC will +not be a stable time source, and must be calibrated against a known, stable, +external clock to be a usable source of time. + +Whether the TSC runs at a constant rate or scales with the P-state is model +dependent and must be determined by inspecting CPUID, chipset or vendor +specific MSR fields. + +In addition, some vendors have known bugs where the P-state is actually +compensated for properly during normal operation, but when the processor is +inactive, the P-state may be raised temporarily to service cache misses from +other processors. In such cases, the TSC on halted CPUs could advance faster +than that of non-halted processors. AMD Turion processors are known to have +this problem. + +3.6. TSC and STPCLK / T-states +------------------------------ + +External signals given to the processor may also have the effect of stopping +the TSC. This is typically done for thermal emergency power control to prevent +an overheating condition, and typically, there is no way to detect that this +condition has happened. + +3.7. TSC virtualization - VMX +----------------------------- + +VMX provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP +instructions, which is enough for full virtualization of TSC in any manner. In +addition, VMX allows passing through the host TSC plus an additional TSC_OFFSET +field specified in the VMCS. Special instructions must be used to read and +write the VMCS field. + +3.8. TSC virtualization - SVM +----------------------------- + +SVM provides conditional trapping of RDTSC, RDMSR, WRMSR and RDTSCP +instructions, which is enough for full virtualization of TSC in any manner. In +addition, SVM allows passing through the host TSC plus an additional offset +field specified in the SVM control block. + +3.9. TSC feature bits in Linux +------------------------------ + +In summary, there is no way to guarantee the TSC remains in perfect +synchronization unless it is explicitly guaranteed by the architecture. Even +if so, the TSCs in multi-sockets or NUMA systems may still run independently +despite being locally consistent. + +The following feature bits are used by Linux to signal various TSC attributes, +but they can only be taken to be meaningful for UP or single node systems. + +========================= ======================================= +X86_FEATURE_TSC The TSC is available in hardware +X86_FEATURE_RDTSCP The RDTSCP instruction is available +X86_FEATURE_CONSTANT_TSC The TSC rate is unchanged with P-states +X86_FEATURE_NONSTOP_TSC The TSC does not stop in C-states +X86_FEATURE_TSC_RELIABLE TSC sync checks are skipped (VMware) +========================= ======================================= + +4. Virtualization Problems +========================== + +Timekeeping is especially problematic for virtualization because a number of +challenges arise. The most obvious problem is that time is now shared between +the host and, potentially, a number of virtual machines. Thus the virtual +operating system does not run with 100% usage of the CPU, despite the fact that +it may very well make that assumption. It may expect it to remain true to very +exacting bounds when interrupt sources are disabled, but in reality only its +virtual interrupt sources are disabled, and the machine may still be preempted +at any time. This causes problems as the passage of real time, the injection +of machine interrupts and the associated clock sources are no longer completely +synchronized with real time. + +This same problem can occur on native hardware to a degree, as SMM mode may +steal cycles from the naturally on X86 systems when SMM mode is used by the +BIOS, but not in such an extreme fashion. However, the fact that SMM mode may +cause similar problems to virtualization makes it a good justification for +solving many of these problems on bare metal. + +4.1. Interrupt clocking +----------------------- + +One of the most immediate problems that occurs with legacy operating systems +is that the system timekeeping routines are often designed to keep track of +time by counting periodic interrupts. These interrupts may come from the PIT +or the RTC, but the problem is the same: the host virtualization engine may not +be able to deliver the proper number of interrupts per second, and so guest +time may fall behind. This is especially problematic if a high interrupt rate +is selected, such as 1000 HZ, which is unfortunately the default for many Linux +guests. + +There are three approaches to solving this problem; first, it may be possible +to simply ignore it. Guests which have a separate time source for tracking +'wall clock' or 'real time' may not need any adjustment of their interrupts to +maintain proper time. If this is not sufficient, it may be necessary to inject +additional interrupts into the guest in order to increase the effective +interrupt rate. This approach leads to complications in extreme conditions, +where host load or guest lag is too much to compensate for, and thus another +solution to the problem has risen: the guest may need to become aware of lost +ticks and compensate for them internally. Although promising in theory, the +implementation of this policy in Linux has been extremely error prone, and a +number of buggy variants of lost tick compensation are distributed across +commonly used Linux systems. + +Windows uses periodic RTC clocking as a means of keeping time internally, and +thus requires interrupt slewing to keep proper time. It does use a low enough +rate (ed: is it 18.2 Hz?) however that it has not yet been a problem in +practice. + +4.2. TSC sampling and serialization +----------------------------------- + +As the highest precision time source available, the cycle counter of the CPU +has aroused much interest from developers. As explained above, this timer has +many problems unique to its nature as a local, potentially unstable and +potentially unsynchronized source. One issue which is not unique to the TSC, +but is highlighted because of its very precise nature is sampling delay. By +definition, the counter, once read is already old. However, it is also +possible for the counter to be read ahead of the actual use of the result. +This is a consequence of the superscalar execution of the instruction stream, +which may execute instructions out of order. Such execution is called +non-serialized. Forcing serialized execution is necessary for precise +measurement with the TSC, and requires a serializing instruction, such as CPUID +or an MSR read. + +Since CPUID may actually be virtualized by a trap and emulate mechanism, this +serialization can pose a performance issue for hardware virtualization. An +accurate time stamp counter reading may therefore not always be available, and +it may be necessary for an implementation to guard against "backwards" reads of +the TSC as seen from other CPUs, even in an otherwise perfectly synchronized +system. + +4.3. Timespec aliasing +---------------------- + +Additionally, this lack of serialization from the TSC poses another challenge +when using results of the TSC when measured against another time source. As +the TSC is much higher precision, many possible values of the TSC may be read +while another clock is still expressing the same value. + +That is, you may read (T,T+10) while external clock C maintains the same value. +Due to non-serialized reads, you may actually end up with a range which +fluctuates - from (T-1.. T+10). Thus, any time calculated from a TSC, but +calibrated against an external value may have a range of valid values. +Re-calibrating this computation may actually cause time, as computed after the +calibration, to go backwards, compared with time computed before the +calibration. + +This problem is particularly pronounced with an internal time source in Linux, +the kernel time, which is expressed in the theoretically high resolution +timespec - but which advances in much larger granularity intervals, sometimes +at the rate of jiffies, and possibly in catchup modes, at a much larger step. + +This aliasing requires care in the computation and recalibration of kvmclock +and any other values derived from TSC computation (such as TSC virtualization +itself). + +4.4. Migration +-------------- + +Migration of a virtual machine raises problems for timekeeping in two ways. +First, the migration itself may take time, during which interrupts cannot be +delivered, and after which, the guest time may need to be caught up. NTP may +be able to help to some degree here, as the clock correction required is +typically small enough to fall in the NTP-correctable window. + +An additional concern is that timers based off the TSC (or HPET, if the raw bus +clock is exposed) may now be running at different rates, requiring compensation +in some way in the hypervisor by virtualizing these timers. In addition, +migrating to a faster machine may preclude the use of a passthrough TSC, as a +faster clock cannot be made visible to a guest without the potential of time +advancing faster than usual. A slower clock is less of a problem, as it can +always be caught up to the original rate. KVM clock avoids these problems by +simply storing multipliers and offsets against the TSC for the guest to convert +back into nanosecond resolution values. + +4.5. Scheduling +--------------- + +Since scheduling may be based on precise timing and firing of interrupts, the +scheduling algorithms of an operating system may be adversely affected by +virtualization. In theory, the effect is random and should be universally +distributed, but in contrived as well as real scenarios (guest device access, +causes of virtualization exits, possible context switch), this may not always +be the case. The effect of this has not been well studied. + +In an attempt to work around this, several implementations have provided a +paravirtualized scheduler clock, which reveals the true amount of CPU time for +which a virtual machine has been running. + +4.6. Watchdogs +-------------- + +Watchdog timers, such as the lock detector in Linux may fire accidentally when +running under hardware virtualization due to timer interrupts being delayed or +misinterpretation of the passage of real time. Usually, these warnings are +spurious and can be ignored, but in some circumstances it may be necessary to +disable such detection. + +4.7. Delays and precision timing +-------------------------------- + +Precise timing and delays may not be possible in a virtualized system. This +can happen if the system is controlling physical hardware, or issues delays to +compensate for slower I/O to and from devices. The first issue is not solvable +in general for a virtualized system; hardware control software can't be +adequately virtualized without a full real-time operating system, which would +require an RT aware virtualization platform. + +The second issue may cause performance problems, but this is unlikely to be a +significant issue. In many cases these delays may be eliminated through +configuration or paravirtualization. + +4.8. Covert channels and leaks +------------------------------ + +In addition to the above problems, time information will inevitably leak to the +guest about the host in anything but a perfect implementation of virtualized +time. This may allow the guest to infer the presence of a hypervisor (as in a +red-pill type detection), and it may allow information to leak between guests +by using CPU utilization itself as a signalling channel. Preventing such +problems would require completely isolated virtual time which may not track +real time any longer. This may be useful in certain security or QA contexts, +but in general isn't recommended for real-world deployment scenarios. -- cgit v1.2.3