diff options
Diffstat (limited to 'Documentation/virt')
-rw-r--r-- | Documentation/virt/hyperv/hibernation.rst | 336 | ||||
-rw-r--r-- | Documentation/virt/hyperv/index.rst | 1 | ||||
-rw-r--r-- | Documentation/virt/hyperv/vmbus.rst | 28 | ||||
-rw-r--r-- | Documentation/virt/kvm/api.rst | 1128 | ||||
-rw-r--r-- | Documentation/virt/kvm/arm/fw-pseudo-registers.rst | 15 | ||||
-rw-r--r-- | Documentation/virt/kvm/arm/hypercalls.rst | 59 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/arm-vgic-its.rst | 5 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/arm-vgic-v3.rst | 89 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/s390_flic.rst | 4 | ||||
-rw-r--r-- | Documentation/virt/kvm/devices/vcpu.rst | 38 | ||||
-rw-r--r-- | Documentation/virt/kvm/locking.rst | 84 | ||||
-rw-r--r-- | Documentation/virt/kvm/review-checklist.rst | 95 | ||||
-rw-r--r-- | Documentation/virt/kvm/x86/errata.rst | 12 | ||||
-rw-r--r-- | Documentation/virt/kvm/x86/index.rst | 1 | ||||
-rw-r--r-- | Documentation/virt/kvm/x86/intel-tdx.rst | 268 | ||||
-rw-r--r-- | Documentation/virt/uml/user_mode_linux_howto_v2.rst | 47 |
16 files changed, 1647 insertions, 563 deletions
diff --git a/Documentation/virt/hyperv/hibernation.rst b/Documentation/virt/hyperv/hibernation.rst new file mode 100644 index 000000000000..4ff27f4a317a --- /dev/null +++ b/Documentation/virt/hyperv/hibernation.rst @@ -0,0 +1,336 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Hibernating Guest VMs +===================== + +Background +---------- +Linux supports the ability to hibernate itself in order to save power. +Hibernation is sometimes called suspend-to-disk, as it writes a memory +image to disk and puts the hardware into the lowest possible power +state. Upon resume from hibernation, the hardware is restarted and the +memory image is restored from disk so that it can resume execution +where it left off. See the "Hibernation" section of +Documentation/admin-guide/pm/sleep-states.rst. + +Hibernation is usually done on devices with a single user, such as a +personal laptop. For example, the laptop goes into hibernation when +the cover is closed, and resumes when the cover is opened again. +Hibernation and resume happen on the same hardware, and Linux kernel +code orchestrating the hibernation steps assumes that the hardware +configuration is not changed while in the hibernated state. + +Hibernation can be initiated within Linux by writing "disk" to +/sys/power/state or by invoking the reboot system call with the +appropriate arguments. This functionality may be wrapped by user space +commands such "systemctl hibernate" that are run directly from a +command line or in response to events such as the laptop lid closing. + +Considerations for Guest VM Hibernation +--------------------------------------- +Linux guests on Hyper-V can also be hibernated, in which case the +hardware is the virtual hardware provided by Hyper-V to the guest VM. +Only the targeted guest VM is hibernated, while other guest VMs and +the underlying Hyper-V host continue to run normally. While the +underlying Windows Hyper-V and physical hardware on which it is +running might also be hibernated using hibernation functionality in +the Windows host, host hibernation and its impact on guest VMs is not +in scope for this documentation. + +Resuming a hibernated guest VM can be more challenging than with +physical hardware because VMs make it very easy to change the hardware +configuration between the hibernation and resume. Even when the resume +is done on the same VM that hibernated, the memory size might be +changed, or virtual NICs or SCSI controllers might be added or +removed. Virtual PCI devices assigned to the VM might be added or +removed. Most such changes cause the resume steps to fail, though +adding a new virtual NIC, SCSI controller, or vPCI device should work. + +Additional complexity can ensue because the disks of the hibernated VM +can be moved to another newly created VM that otherwise has the same +virtual hardware configuration. While it is desirable for resume from +hibernation to succeed after such a move, there are challenges. See +details on this scenario and its limitations in the "Resuming on a +Different VM" section below. + +Hyper-V also provides ways to move a VM from one Hyper-V host to +another. Hyper-V tries to ensure processor model and Hyper-V version +compatibility using VM Configuration Versions, and prevents moves to +a host that isn't compatible. Linux adapts to host and processor +differences by detecting them at boot time, but such detection is not +done when resuming execution in the hibernation image. If a VM is +hibernated on one host, then resumed on a host with a different processor +model or Hyper-V version, settings recorded in the hibernation image +may not match the new host. Because Linux does not detect such +mismatches when resuming the hibernation image, undefined behavior +and failures could result. + + +Enabling Guest VM Hibernation +----------------------------- +Hibernation of a Hyper-V guest VM is disabled by default because +hibernation is incompatible with memory hot-add, as provided by the +Hyper-V balloon driver. If hot-add is used and the VM hibernates, it +hibernates with more memory than it started with. But when the VM +resumes from hibernation, Hyper-V gives the VM only the originally +assigned memory, and the memory size mismatch causes resume to fail. + +To enable a Hyper-V VM for hibernation, the Hyper-V administrator must +enable the ACPI virtual S4 sleep state in the ACPI configuration that +Hyper-V provides to the guest VM. Such enablement is accomplished by +modifying a WMI property of the VM, the steps for which are outside +the scope of this documentation but are available on the web. +Enablement is treated as the indicator that the administrator +prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V +balloon driver in Linux disables hot-add. Enablement is indicated if +the contents of /sys/power/disk contains "platform" as an option. The +enablement is also visible in /sys/bus/vmbus/hibernation. See function +hv_is_hibernation_supported(). + +Linux supports ACPI sleep states on x86, but not on arm64. So Linux +guest VM hibernation is not available on Hyper-V for arm64. + +Initiating Guest VM Hibernation +------------------------------- +Guest VMs can self-initiate hibernation using the standard Linux +methods of writing "disk" to /sys/power/state or the reboot system +call. As an additional layer, Linux guests on Hyper-V support the +"Shutdown" integration service, via which a Hyper-V administrator can +tell a Linux VM to hibernate using a command outside the VM. The +command generates a request to the Hyper-V shutdown driver in Linux, +which sends the uevent "EVENT=hibernate". See kernel functions +shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule +must be provided in the VM that handles this event and initiates +hibernation. + +Handling VMBus Devices During Hibernation & Resume +-------------------------------------------------- +The VMBus bus driver, and the individual VMBus device drivers, +implement suspend and resume functions that are called as part of the +Linux orchestration of hibernation and of resuming from hibernation. +The overall approach is to leave in place the data structures for the +primary VMBus channels and their associated Linux devices, such as +SCSI controllers and others, so that they are captured in the +hibernation image. This approach allows any state associated with the +device to be persisted across the hibernation/resume. When the VM +resumes, the devices are re-offered by Hyper-V and are connected to +the data structures that already exist in the resumed hibernation +image. + +VMBus devices are identified by class and instance GUID. (See section +"VMBus device creation/deletion" in +Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation, +the resume functions expect that the devices offered by Hyper-V have +the same class/instance GUIDs as the devices present at the time of +hibernation. Having the same class/instance GUIDs allows the offered +devices to be matched to the primary VMBus channel data structures in +the memory of the now resumed hibernation image. If any devices are +offered that don't match primary VMBus channel data structures that +already exist, they are processed normally as newly added devices. If +primary VMBus channels that exist in the resumed hibernation image are +not matched with a device offered in the resumed VM, the resume +sequence waits for 10 seconds, then proceeds. But the unmatched device +is likely to cause errors in the resumed VM. + +When resuming existing primary VMBus channels, the newly offered +relids might be different because relids can change on each VM boot, +even if the VM configuration hasn't changed. The VMBus bus driver +resume function matches the class/instance GUIDs, and updates the +relids in case they have changed. + +VMBus sub-channels are not persisted in the hibernation image. Each +VMBus device driver's suspend function must close any sub-channels +prior to hibernation. Closing a sub-channel causes Hyper-V to send a +RESCIND_CHANNELOFFER message, which Linux processes by freeing the +channel data structures so that all vestiges of the sub-channel are +removed. By contrast, primary channels are marked closed and their +ring buffers are freed, but Hyper-V does not send a rescind message, +so the channel data structure continues to exist. Upon resume, the +device driver's resume function re-allocates the ring buffer and +re-opens the existing channel. It then communicates with Hyper-V to +re-open sub-channels from scratch. + +The Linux ends of Hyper-V sockets are forced closed at the time of +hibernation. The guest can't force closing the host end of the socket, +but any host-side actions on the host end will produce an error. + +VMBus devices use the same suspend function for the "freeze" and the +"poweroff" phases, and the same resume function for the "thaw" and +"restore" phases. See the "Entering Hibernation" section of +Documentation/driver-api/pm/devices.rst for the sequencing of the +phases. + +Detailed Hibernation Sequence +----------------------------- +1. The Linux power management (PM) subsystem prepares for + hibernation by freezing user space processes and allocating + memory to hold the hibernation image. +2. As part of the "freeze" phase, Linux PM calls the "suspend" + function for each VMBus device in turn. As described above, this + function removes sub-channels, and leaves the primary channel in + a closed state. +3. Linux PM calls the "suspend" function for the VMBus bus, which + closes any Hyper-V socket channels and unloads the top-level + VMBus connection with the Hyper-V host. +4. Linux PM disables non-boot CPUs, creates the hibernation image in + the previously allocated memory, then re-enables non-boot CPUs. + The hibernation image contains the memory data structures for the + closed primary channels, but no sub-channels. +5. As part of the "thaw" phase, Linux PM calls the "resume" function + for the VMBus bus, which re-establishes the top-level VMBus + connection and requests that Hyper-V re-offer the VMBus devices. + As offers are received for the primary channels, the relids are + updated as previously described. +6. Linux PM calls the "resume" function for each VMBus device. Each + device re-opens its primary channel, and communicates with Hyper-V + to re-establish sub-channels if appropriate. The sub-channels + are re-created as new channels since they were previously removed + entirely in Step 2. +7. With VMBus devices now working again, Linux PM writes the + hibernation image from memory to disk. +8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff" + phase. VMBus channels are closed and the top-level VMBus + connection is unloaded. +9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state + S4. Hibernation is now complete. + +Detailed Resume Sequence +------------------------ +1. The guest VM boots into a fresh Linux OS instance. During boot, + the top-level VMBus connection is established, and synthetic + devices are enabled. This happens via the normal paths that don't + involve hibernation. +2. Linux PM hibernation code reads swap space is to find and read + the hibernation image into memory. If there is no hibernation + image, then this boot becomes a normal boot. +3. If this is a resume from hibernation, the "freeze" phase is used + to shutdown VMBus devices and unload the top-level VMBus + connection in the running fresh OS instance, just like Steps 2 + and 3 in the hibernation sequence. +4. Linux PM disables non-boot CPUs, and transfers control to the + read-in hibernation image. In the now-running hibernation image, + non-boot CPUs are restarted. +5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6 + from the hibernation sequence. The top-level VMBus connection is + re-established, and offers are received and matched to primary + channels in the image. Relids are updated. VMBus device resume + functions re-open primary channels and re-create sub-channels. +6. Linux PM exits the hibernation resume sequence and the VM is now + running normally from the hibernation image. + +Key-Value Pair (KVP) Pseudo-Device Anomalies +-------------------------------------------- +The VMBus KVP device behaves differently from other pseudo-devices +offered by Hyper-V. When the KVP primary channel is closed, Hyper-V +sends a rescind message, which causes all vestiges of the device to be +removed. But Hyper-V then re-offers the device, causing it to be newly +re-created. The removal and re-creation occurs during the "freeze" +phase of hibernation, so the hibernation image contains the re-created +KVP device. Similar behavior occurs during the "freeze" phase of the +resume sequence while still in the fresh OS instance. But in both +cases, the top-level VMBus connection is subsequently unloaded, which +causes the device to be discarded on the Hyper-V side. So no harm is +done and everything still works. + +Virtual PCI devices +------------------- +Virtual PCI devices are physical PCI devices that are mapped directly +into the VM's physical address space so the VM can interact directly +with the hardware. vPCI devices include those accessed via what Hyper-V +calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC +Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst. + +Hyper-V DDA devices are offered to guest VMs after the top-level VMBus +connection is established, just like VMBus synthetic devices. They are +statically assigned to the VM, and their instance GUIDs don't change +unless the Hyper-V administrator makes changes to the configuration. +DDA devices are represented in Linux as virtual PCI devices that have +a VMBus identity as well as a PCI identity. Consequently, Linux guest +hibernation first handles DDA devices as VMBus devices in order to +manage the VMBus channel. But then they are also handled as PCI +devices using the hibernation functions implemented by their native +PCI driver. + +SR-IOV NIC VFs also have a VMBus identity as well as a PCI +identity, and overall are processed similarly to DDA devices. A +difference is that VFs are not offered to the VM during initial boot +of the VM. Instead, the VMBus synthetic NIC driver first starts +operating and communicates to Hyper-V that it is prepared to accept a +VF, and then the VF offer is made. However, the VMBus connection +might later be unloaded and then re-established without the VM being +rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation +Sequence above and in the Detailed Resume Sequence. In such a case, +the VFs likely became part of the VM during initial boot, so when the +VMBus connection is re-established, the VFs are offered on the +re-established connection without intervention by the synthetic NIC driver. + +UIO Devices +----------- +A VMBus device can be exposed to user space using the Hyper-V UIO +driver (uio_hv_generic.c) so that a user space driver can control and +operate the device. However, the VMBus UIO driver does not support the +suspend and resume operations needed for hibernation. If a VMBus +device is configured to use the UIO driver, hibernating the VM fails +and Linux continues to run normally. The most common use of the Hyper-V +UIO driver is for DPDK networking, but there are other uses as well. + +Resuming on a Different VM +-------------------------- +This scenario occurs in the Azure public cloud in that a hibernated +customer VM only exists as saved configuration and disks -- the VM no +longer exists on any Hyper-V host. When the customer VM is resumed, a +new Hyper-V VM with identical configuration is created, likely on a +different Hyper-V host. That new Hyper-V VM becomes the resumed +customer VM, and the steps the Linux kernel takes to resume from the +hibernation image must work in that new VM. + +While the disks and their contents are preserved from the original VM, +the Hyper-V-provided VMBus instance GUIDs of the disk controllers and +other synthetic devices would typically be different. The difference +would cause the resume from hibernation to fail, so several things are +done to solve this problem: + +* For VMBus synthetic devices that support only a single instance, + Hyper-V always assigns the same instance GUIDs. For example, the + Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo + device, etc., always have the same instance GUID, both for local + Hyper-V installs as well as in the Azure cloud. + +* VMBus synthetic SCSI controllers may have multiple instances in a + VM, and in the general case instance GUIDs vary from VM to VM. + However, Azure VMs always have exactly two synthetic SCSI + controllers, and Azure code overrides the normal Hyper-V behavior + so these controllers are always assigned the same two instance + GUIDs. Consequently, when a customer VM is resumed on a newly + created VM, the instance GUIDs match. But this guarantee does not + hold for local Hyper-V installs. + +* Similarly, VMBus synthetic NICs may have multiple instances in a + VM, and the instance GUIDs vary from VM to VM. Again, Azure code + overrides the normal Hyper-V behavior so that the instance GUID + of a synthetic NIC in a customer VM does not change, even if the + customer VM is deallocated or hibernated, and then re-constituted + on a newly created VM. As with SCSI controllers, this behavior + does not hold for local Hyper-V installs. + +* vPCI devices do not have the same instance GUIDs when resuming + from hibernation on a newly created VM. Consequently, Azure does + not support hibernation for VMs that have DDA devices such as + NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the + VF from the VM before it hibernates so that the hibernation image + does not contain a VF device. When the VM is resumed it + instantiates a new VF, rather than trying to match against a VF + that is present in the hibernation image. Because Azure must + remove any VFs before initiating hibernation, Azure VM + hibernation must be initiated externally from the Azure Portal or + Azure CLI, which in turn uses the Shutdown integration service to + tell Linux to do the hibernation. If hibernation is self-initiated + within the Azure VM, VFs remain in the hibernation image, and are + not resumed properly. + +In summary, Azure takes special actions to remove VFs and to ensure +that VMBus device instance GUIDs match on a new/different VM, allowing +hibernation to work for most general-purpose Azure VMs sizes. While +similar special actions could be taken when resuming on a different VM +on a local Hyper-V install, orchestrating such actions is not provided +out-of-the-box by local Hyper-V and so requires custom scripting. diff --git a/Documentation/virt/hyperv/index.rst b/Documentation/virt/hyperv/index.rst index 79bc4080329e..c84c40fd61c9 100644 --- a/Documentation/virt/hyperv/index.rst +++ b/Documentation/virt/hyperv/index.rst @@ -11,4 +11,5 @@ Hyper-V Enlightenments vmbus clocks vpci + hibernation coco diff --git a/Documentation/virt/hyperv/vmbus.rst b/Documentation/virt/hyperv/vmbus.rst index 1dcef6a7fda3..654bb4849972 100644 --- a/Documentation/virt/hyperv/vmbus.rst +++ b/Documentation/virt/hyperv/vmbus.rst @@ -250,10 +250,18 @@ interrupts are not Linux IRQs, there are no entries in /proc/interrupts or /proc/irq corresponding to individual VMBus channel interrupts. An online CPU in a Linux guest may not be taken offline if it has -VMBus channel interrupts assigned to it. Any such channel -interrupts must first be manually reassigned to another CPU as -described above. When no channel interrupts are assigned to the -CPU, it can be taken offline. +VMBus channel interrupts assigned to it. Starting in kernel v6.15, +any such interrupts are automatically reassigned to some other CPU +at the time of offlining. The "other" CPU is chosen by the +implementation and is not load balanced or otherwise intelligently +determined. If the CPU is onlined again, channel interrupts previously +assigned to it are not moved back. As a result, after multiple CPUs +have been offlined, and perhaps onlined again, the interrupt-to-CPU +mapping may be scrambled and non-optimal. In such a case, optimal +assignments must be re-established manually. For kernels v6.14 and +earlier, any conflicting channel interrupts must first be manually +reassigned to another CPU as described above. Then when no channel +interrupts are assigned to the CPU, it can be taken offline. The VMBus channel interrupt handling code is designed to work correctly even if an interrupt is received on a CPU other than the @@ -324,3 +332,15 @@ rescinded, neither Hyper-V nor Linux retains any state about its previous existence. Such a device might be re-added later, in which case it is treated as an entirely new device. See vmbus_onoffer_rescind(). + +For some devices, such as the KVP device, Hyper-V automatically +sends a rescind message when the primary channel is closed, +likely as a result of unbinding the device from its driver. +The rescind causes Linux to remove the device. But then Hyper-V +immediately reoffers the device to the guest, causing a new +instance of the device to be created in Linux. For other +devices, such as the synthetic SCSI and NIC devices, closing the +primary channel does *not* result in Hyper-V sending a rescind +message. The device continues to exist in Linux on the VMBus, +but with no driver bound to it. The same driver or a new driver +can subsequently be bound to the existing instance of the device. diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index edc070c6e19b..6aa40ee05a4a 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -7,8 +7,19 @@ The Definitive KVM (Kernel-based Virtual Machine) API Documentation 1. General description ====================== -The kvm API is a set of ioctls that are issued to control various aspects -of a virtual machine. The ioctls belong to the following classes: +The kvm API is centered around different kinds of file descriptors +and ioctls that can be issued to these file descriptors. An initial +open("/dev/kvm") obtains a handle to the kvm subsystem; this handle +can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this +handle will create a VM file descriptor which can be used to issue VM +ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will +create a virtual cpu or device and return a file descriptor pointing to +the new resource. + +In other words, the kvm API is a set of ioctls that are issued to +different kinds of file descriptor in order to control various aspects of +a virtual machine. Depending on the file descriptor that accepts them, +ioctls belong to the following classes: - System ioctls: These query and set global attributes which affect the whole kvm subsystem. In addition a system ioctl is used to create @@ -35,18 +46,19 @@ of a virtual machine. The ioctls belong to the following classes: device ioctls must be issued from the same process (address space) that was used to create the VM. -2. File descriptors -=================== +While most ioctls are specific to one kind of file descriptor, in some +cases the same ioctl can belong to more than one class. + +The KVM API grew over time. For this reason, KVM defines many constants +of the form ``KVM_CAP_*``, each corresponding to a set of functionality +provided by one or more ioctls. Availability of these "capabilities" can +be checked with :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. Some +capabilities also need to be enabled for VMs or VCPUs where their +functionality is desired (see :ref:`cap_enable` and :ref:`cap_enable_vm`). -The kvm API is centered around file descriptors. An initial -open("/dev/kvm") obtains a handle to the kvm subsystem; this handle -can be used to issue system ioctls. A KVM_CREATE_VM ioctl on this -handle will create a VM file descriptor which can be used to issue VM -ioctls. A KVM_CREATE_VCPU or KVM_CREATE_DEVICE ioctl on a VM fd will -create a virtual cpu or device and return a file descriptor pointing to -the new resource. Finally, ioctls on a vcpu or device fd can be used -to control the vcpu or device. For vcpus, this includes the important -task of actually running guest code. + +2. Restrictions +=============== In general file descriptors can be migrated among processes by means of fork() and the SCM_RIGHTS facility of unix domain socket. These @@ -96,12 +108,9 @@ description: Capability: which KVM extension provides this ioctl. Can be 'basic', which means that is will be provided by any kernel that supports - API version 12 (see section 4.1), a KVM_CAP_xyz constant, which - means availability needs to be checked with KVM_CHECK_EXTENSION - (see section 4.4), or 'none' which means that while not all kernels - support this ioctl, there's no capability bit to check its - availability: for kernels that don't support the ioctl, - the ioctl returns -ENOTTY. + API version 12 (see :ref:`KVM_GET_API_VERSION <KVM_GET_API_VERSION>`), + or a KVM_CAP_xyz constant that can be checked with + :ref:`KVM_CHECK_EXTENSION <KVM_CHECK_EXTENSION>`. Architectures: which instruction set architectures provide this ioctl. @@ -118,6 +127,8 @@ description: are not detailed, but errors with specific meanings are. +.. _KVM_GET_API_VERSION: + 4.1 KVM_GET_API_VERSION ----------------------- @@ -246,6 +257,8 @@ This list also varies by kvm version and host processor, but does not change otherwise. +.. _KVM_CHECK_EXTENSION: + 4.4 KVM_CHECK_EXTENSION ----------------------- @@ -288,7 +301,7 @@ the VCPU file descriptor can be mmap-ed, including: - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. For more information on - KVM_CAP_DIRTY_LOG_RING, see section 8.3. + KVM_CAP_DIRTY_LOG_RING, see :ref:`KVM_CAP_DIRTY_LOG_RING`. 4.7 KVM_CREATE_VCPU @@ -338,8 +351,8 @@ KVM_S390_SIE_PAGE_OFFSET in order to obtain a memory map of the virtual cpu's hardware control block. -4.8 KVM_GET_DIRTY_LOG (vm ioctl) --------------------------------- +4.8 KVM_GET_DIRTY_LOG +--------------------- :Capability: basic :Architectures: all @@ -987,6 +1000,10 @@ blobs in userspace. When the guest writes the MSR, kvm copies one page of a blob (32- or 64-bit, depending on the vcpu mode) to guest memory. +The MSR index must be in the range [0x40000000, 0x4fffffff], i.e. must reside +in the range that is unofficially reserved for use by hypervisors. The min/max +values are enumerated via KVM_XEN_MSR_MIN_INDEX and KVM_XEN_MSR_MAX_INDEX. + :: struct kvm_xen_hvm_config { @@ -1298,7 +1315,7 @@ See KVM_GET_VCPU_EVENTS for the data structure. :Capability: KVM_CAP_DEBUGREGS :Architectures: x86 -:Type: vm ioctl +:Type: vcpu ioctl :Parameters: struct kvm_debugregs (out) :Returns: 0 on success, -1 on error @@ -1320,7 +1337,7 @@ Reads debug registers from the vcpu. :Capability: KVM_CAP_DEBUGREGS :Architectures: x86 -:Type: vm ioctl +:Type: vcpu ioctl :Parameters: struct kvm_debugregs (in) :Returns: 0 on success, -1 on error @@ -1394,6 +1411,9 @@ the memory region are automatically reflected into the guest. For example, an mmap() that affects the region will be made visible immediately. Another example is madvise(MADV_DROP). +For TDX guest, deleting/moving memory region loses guest memory contents. +Read only region isn't supported. Only as-id 0 is supported. + Note: On arm64, a write generated by the page-table walker (to update the Access and Dirty flags, for example) never results in a KVM_EXIT_MMIO exit when the slot has the KVM_MEM_READONLY flag. This @@ -1406,7 +1426,7 @@ fetch) is injected in the guest. S390: ^^^^^ -Returns -EINVAL if the VM has the KVM_VM_S390_UCONTROL flag set. +Returns -EINVAL or -EEXIST if the VM has the KVM_VM_S390_UCONTROL flag set. Returns -EINVAL if called on a protected VM. 4.36 KVM_SET_TSS_ADDR @@ -1429,6 +1449,8 @@ because of a quirk in the virtualization implementation (see the internals documentation when it pops into existence). +.. _KVM_ENABLE_CAP: + 4.37 KVM_ENABLE_CAP ------------------- @@ -1810,15 +1832,18 @@ emulate them efficiently. The fields in each entry are defined as follows: the values returned by the cpuid instruction for this function/index combination -The TSC deadline timer feature (CPUID leaf 1, ecx[24]) is always returned -as false, since the feature depends on KVM_CREATE_IRQCHIP for local APIC -support. Instead it is reported via:: +x2APIC (CPUID leaf 1, ecx[21) and TSC deadline timer (CPUID leaf 1, ecx[24]) +may be returned as true, but they depend on KVM_CREATE_IRQCHIP for in-kernel +emulation of the local APIC. TSC deadline timer support is also reported via:: ioctl(KVM_CHECK_EXTENSION, KVM_CAP_TSC_DEADLINE_TIMER) if that returns true and you use KVM_CREATE_IRQCHIP, or if you emulate the feature in userspace, then you can enable the feature for KVM_SET_CPUID2. +Enabling x2APIC in KVM_SET_CPUID2 requires KVM_CREATE_IRQCHIP as KVM doesn't +support forwarding x2APIC MSR accesses to userspace, i.e. KVM does not support +emulating x2APIC in userspace. 4.47 KVM_PPC_GET_PVINFO ----------------------- @@ -1899,6 +1924,9 @@ No flags are specified so far, the corresponding field must be set to zero. #define KVM_IRQ_ROUTING_HV_SINT 4 #define KVM_IRQ_ROUTING_XEN_EVTCHN 5 +On s390, adding a KVM_IRQ_ROUTING_S390_ADAPTER is rejected on ucontrol VMs with +error -EINVAL. + flags: - KVM_MSI_VALID_DEVID: used along with KVM_IRQ_ROUTING_MSI routing entry @@ -1978,7 +2006,14 @@ frequency is KHz. If the KVM_CAP_VM_TSC_CONTROL capability is advertised, this can also be used as a vm ioctl to set the initial tsc frequency of subsequently -created vCPUs. +created vCPUs. Note, the vm ioctl is only allowed prior to creating vCPUs. + +For TSC protected Confidential Computing (CoCo) VMs where TSC frequency +is configured once at VM scope and remains unchanged during VM's +lifetime, the vm ioctl should be used to configure the TSC frequency +and the vcpu ioctl is not supported. + +Example of such CoCo VMs: TDX guests. 4.56 KVM_GET_TSC_KHZ -------------------- @@ -2116,8 +2151,8 @@ TLB, prior to calling KVM_RUN on the associated vcpu. The "bitmap" field is the userspace address of an array. This array consists of a number of bits, equal to the total number of TLB entries as -determined by the last successful call to KVM_CONFIG_TLB, rounded up to the -nearest multiple of 64. +determined by the last successful call to ``KVM_ENABLE_CAP(KVM_CAP_SW_TLB)``, +rounded up to the nearest multiple of 64. Each bit corresponds to one TLB entry, ordered the same as in the shared TLB array. @@ -2170,42 +2205,6 @@ userspace update the TCE table directly which is useful in some circumstances. -4.63 KVM_ALLOCATE_RMA ---------------------- - -:Capability: KVM_CAP_PPC_RMA -:Architectures: powerpc -:Type: vm ioctl -:Parameters: struct kvm_allocate_rma (out) -:Returns: file descriptor for mapping the allocated RMA - -This allocates a Real Mode Area (RMA) from the pool allocated at boot -time by the kernel. An RMA is a physically-contiguous, aligned region -of memory used on older POWER processors to provide the memory which -will be accessed by real-mode (MMU off) accesses in a KVM guest. -POWER processors support a set of sizes for the RMA that usually -includes 64MB, 128MB, 256MB and some larger powers of two. - -:: - - /* for KVM_ALLOCATE_RMA */ - struct kvm_allocate_rma { - __u64 rma_size; - }; - -The return value is a file descriptor which can be passed to mmap(2) -to map the allocated RMA into userspace. The mapped area can then be -passed to the KVM_SET_USER_MEMORY_REGION ioctl to establish it as the -RMA for a virtual machine. The size of the RMA in bytes (which is -fixed at host kernel boot time) is returned in the rma_size field of -the argument structure. - -The KVM_CAP_PPC_RMA capability is 1 or 2 if the KVM_ALLOCATE_RMA ioctl -is supported; 2 if the processor requires all virtual machines to have -an RMA, or 1 if the processor can use an RMA but doesn't require it, -because it supports the Virtual RMA (VRMA) facility. - - 4.64 KVM_NMI ------------ @@ -2602,7 +2601,7 @@ Specifically: ======================= ========= ===== ======================================= .. [1] These encodings are not accepted for SVE-enabled vcpus. See - KVM_ARM_VCPU_INIT. + :ref:`KVM_ARM_VCPU_INIT`. The equivalent register content can be accessed via bits [127:0] of the corresponding SVE Zn registers instead for vcpus that have SVE @@ -3471,7 +3470,8 @@ The initial values are defined as: - FPSIMD/NEON registers: set to 0 - SVE registers: set to 0 - System registers: Reset to their architecturally defined - values as for a warm reset to EL1 (resp. SVC) + values as for a warm reset to EL1 (resp. SVC) or EL2 (in the + case of EL2 being enabled). Note that because some registers reflect machine topology, all vcpus should be created before this ioctl is invoked. @@ -3538,6 +3538,17 @@ Possible features: - the KVM_REG_ARM64_SVE_VLS pseudo-register is immutable, and can no longer be written using KVM_SET_ONE_REG. + - KVM_ARM_VCPU_HAS_EL2: Enable Nested Virtualisation support, + booting the guest from EL2 instead of EL1. + Depends on KVM_CAP_ARM_EL2. + The VM is running with HCR_EL2.E2H being RES1 (VHE) unless + KVM_ARM_VCPU_HAS_EL2_E2H0 is also set. + + - KVM_ARM_VCPU_HAS_EL2_E2H0: Restrict Nested Virtualisation + support to HCR_EL2.E2H being RES0 (non-VHE). + Depends on KVM_CAP_ARM_EL2_E2H0. + KVM_ARM_VCPU_HAS_EL2 must also be set. + 4.83 KVM_ARM_PREFERRED_TARGET ----------------------------- @@ -3593,6 +3604,27 @@ Errors: This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. +Note that s390 does not support KVM_GET_REG_LIST for historical reasons +(read: nobody cared). The set of registers in kernels 4.x and newer is: + +- KVM_REG_S390_TODPR + +- KVM_REG_S390_EPOCHDIFF + +- KVM_REG_S390_CPU_TIMER + +- KVM_REG_S390_CLOCK_COMP + +- KVM_REG_S390_PFTOKEN + +- KVM_REG_S390_PFCOMPARE + +- KVM_REG_S390_PFSELECT + +- KVM_REG_S390_PP + +- KVM_REG_S390_GBEA + 4.85 KVM_ARM_SET_DEVICE_ADDR (deprecated) ----------------------------------------- @@ -4758,7 +4790,7 @@ H_GET_CPU_CHARACTERISTICS hypercall. :Capability: basic :Architectures: x86 -:Type: vm +:Type: vm ioctl, vcpu ioctl :Parameters: an opaque platform specific structure (in/out) :Returns: 0 on success; -1 on error @@ -4766,9 +4798,11 @@ If the platform supports creating encrypted VMs then this ioctl can be used for issuing platform-specific memory encryption commands to manage those encrypted VMs. -Currently, this ioctl is used for issuing Secure Encrypted Virtualization -(SEV) commands on AMD Processors. The SEV commands are defined in -Documentation/virt/kvm/x86/amd-memory-encryption.rst. +Currently, this ioctl is used for issuing both Secure Encrypted Virtualization +(SEV) commands on AMD Processors and Trusted Domain Extensions (TDX) commands +on Intel Processors. The detailed commands are defined in +Documentation/virt/kvm/x86/amd-memory-encryption.rst and +Documentation/virt/kvm/x86/intel-tdx.rst. 4.111 KVM_MEMORY_ENCRYPT_REG_REGION ----------------------------------- @@ -4956,8 +4990,8 @@ Coalesced pio is based on coalesced mmio. There is little difference between coalesced mmio and pio except that coalesced pio records accesses to I/O ports. -4.117 KVM_CLEAR_DIRTY_LOG (vm ioctl) ------------------------------------- +4.117 KVM_CLEAR_DIRTY_LOG +------------------------- :Capability: KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 :Architectures: x86, arm64, mips @@ -5093,8 +5127,8 @@ Recognised values for feature: Finalizes the configuration of the specified vcpu feature. The vcpu must already have been initialised, enabling the affected feature, by -means of a successful KVM_ARM_VCPU_INIT call with the appropriate flag set in -features[]. +means of a successful :ref:`KVM_ARM_VCPU_INIT <KVM_ARM_VCPU_INIT>` call with the +appropriate flag set in features[]. For affected vcpu features, this is a mandatory step that must be performed before the vcpu is fully usable. @@ -5266,7 +5300,7 @@ the cpu reset definition in the POP (Principles Of Operation). 4.123 KVM_S390_INITIAL_RESET ---------------------------- -:Capability: none +:Capability: basic :Architectures: s390 :Type: vcpu ioctl :Parameters: none @@ -5574,7 +5608,7 @@ KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA in guest physical address space. This attribute should be used in preference to KVM_XEN_ATTR_TYPE_SHARED_INFO as it avoids unnecessary invalidation of an internal cache when the page is - re-mapped in guest physcial address space. + re-mapped in guest physical address space. Setting the hva to zero will disable the shared_info page. @@ -6205,7 +6239,7 @@ applied. .. _KVM_ARM_GET_REG_WRITABLE_MASKS: 4.139 KVM_ARM_GET_REG_WRITABLE_MASKS -------------------------------------------- +------------------------------------ :Capability: KVM_CAP_ARM_SUPPORTED_REG_MASK_RANGES :Architectures: arm64 @@ -6443,6 +6477,8 @@ the capability to be present. `flags` must currently be zero. +.. _kvm_run: + 5. The kvm_run structure ======================== @@ -6616,7 +6652,8 @@ to the byte array. .. note:: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR, KVM_EXIT_XEN, - KVM_EXIT_EPR, KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding + KVM_EXIT_EPR, KVM_EXIT_HYPERCALL, KVM_EXIT_TDX, + KVM_EXIT_X86_RDMSR and KVM_EXIT_X86_WRMSR the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. @@ -6815,6 +6852,7 @@ should put the acknowledged interrupt vector into the 'epr' field. #define KVM_SYSTEM_EVENT_WAKEUP 4 #define KVM_SYSTEM_EVENT_SUSPEND 5 #define KVM_SYSTEM_EVENT_SEV_TERM 6 + #define KVM_SYSTEM_EVENT_TDX_FATAL 7 __u32 type; __u32 ndata; __u64 data[16]; @@ -6841,6 +6879,11 @@ Valid values for 'type' are: reset/shutdown of the VM. - KVM_SYSTEM_EVENT_SEV_TERM -- an AMD SEV guest requested termination. The guest physical address of the guest's GHCB is stored in `data[0]`. + - KVM_SYSTEM_EVENT_TDX_FATAL -- a TDX guest reported a fatal error state. + KVM doesn't do any parsing or conversion, it just dumps 16 general-purpose + registers to userspace, in ascending order of the 4-bit indices for x86-64 + general-purpose registers in instruction encoding, as defined in the Intel + SDM. - KVM_SYSTEM_EVENT_WAKEUP -- the exiting vCPU is in a suspended state and KVM has recognized a wakeup event. Userspace may honor this event by marking the exiting vCPU as runnable, or deny it and call KVM_RUN again. @@ -6855,6 +6898,10 @@ the first `ndata` items (possibly zero) of the data array are valid. the guest issued a SYSTEM_RESET2 call according to v1.1 of the PSCI specification. + - for arm64, data[0] is set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2 + if the guest issued a SYSTEM_OFF2 call according to v1.3 of the PSCI + specification. + - for RISC-V, data[0] is set to the value of the second argument of the ``sbi_system_reset`` call. @@ -6888,6 +6935,12 @@ either: - Deny the guest request to suspend the VM. See ARM DEN0022D.b 5.19.2 "Caller responsibilities" for possible return values. +Hibernation using the PSCI SYSTEM_OFF2 call is enabled when PSCI v1.3 +is enabled. If a guest invokes the PSCI SYSTEM_OFF2 function, KVM will +exit to userspace with the KVM_SYSTEM_EVENT_SHUTDOWN event type and with +data[0] set to KVM_SYSTEM_EVENT_SHUTDOWN_FLAG_PSCI_OFF2. The only +supported hibernate type for the SYSTEM_OFF2 function is HIBERNATE_OFF. + :: /* KVM_EXIT_IOAPIC_EOI */ @@ -7131,6 +7184,69 @@ The valid value for 'flags' is: :: + /* KVM_EXIT_TDX */ + struct { + __u64 flags; + __u64 nr; + union { + struct { + u64 ret; + u64 data[5]; + } unknown; + struct { + u64 ret; + u64 gpa; + u64 size; + } get_quote; + struct { + u64 ret; + u64 leaf; + u64 r11, r12, r13, r14; + } get_tdvmcall_info; + struct { + u64 ret; + u64 vector; + } setup_event_notify; + }; + } tdx; + +Process a TDVMCALL from the guest. KVM forwards select TDVMCALL based +on the Guest-Hypervisor Communication Interface (GHCI) specification; +KVM bridges these requests to the userspace VMM with minimal changes, +placing the inputs in the union and copying them back to the guest +on re-entry. + +Flags are currently always zero, whereas ``nr`` contains the TDVMCALL +number from register R11. The remaining field of the union provide the +inputs and outputs of the TDVMCALL. Currently the following values of +``nr`` are defined: + + * ``TDVMCALL_GET_QUOTE``: the guest has requested to generate a TD-Quote + signed by a service hosting TD-Quoting Enclave operating on the host. + Parameters and return value are in the ``get_quote`` field of the union. + The ``gpa`` field and ``size`` specify the guest physical address + (without the shared bit set) and the size of a shared-memory buffer, in + which the TDX guest passes a TD Report. The ``ret`` field represents + the return value of the GetQuote request. When the request has been + queued successfully, the TDX guest can poll the status field in the + shared-memory area to check whether the Quote generation is completed or + not. When completed, the generated Quote is returned via the same buffer. + + * ``TDVMCALL_GET_TD_VM_CALL_INFO``: the guest has requested the support + status of TDVMCALLs. The output values for the given leaf should be + placed in fields from ``r11`` to ``r14`` of the ``get_tdvmcall_info`` + field of the union. + + * ``TDVMCALL_SETUP_EVENT_NOTIFY_INTERRUPT``: the guest has requested to + set up a notification interrupt for vector ``vector``. + +KVM may add support for more values in the future that may cause a userspace +exit, even without calls to ``KVM_ENABLE_CAP`` or similar. In this case, +it will enter with output fields already valid; in the common case, the +``unknown.ret`` field of the union will be ``TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED``. +Userspace need not do anything if it does not wish to support a TDVMCALL. +:: + /* Fix the size of the union. */ char padding[256]; }; @@ -7162,11 +7278,15 @@ primary storage for certain register types. Therefore, the kernel may use the values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set. +.. _cap_enable: + 6. Capabilities that can be enabled on vCPUs ============================================ There are certain capabilities that change the behavior of the virtual CPU or -the virtual machine when enabled. To enable them, please see section 4.37. +the virtual machine when enabled. To enable them, please see +:ref:`KVM_ENABLE_CAP`. + Below you can find a list of capabilities and what their effect on the vCPU or the virtual machine is when enabling them. @@ -7375,7 +7495,7 @@ KVM API and also from the guest. sets are supported (bitfields defined in arch/x86/include/uapi/asm/kvm.h). -As described above in the kvm_sync_regs struct info in section 5 (kvm_run): +As described above in the kvm_sync_regs struct info in section :ref:`kvm_run`, KVM_CAP_SYNC_REGS "allow[s] userspace to access certain guest registers without having to call SET/GET_*REGS". This reduces overhead by eliminating repeated ioctl calls for setting and/or getting register values. This is @@ -7421,13 +7541,84 @@ Unused bitfields in the bitarrays must be set to zero. This capability connects the vcpu to an in-kernel XIVE device. +6.76 KVM_CAP_HYPERV_SYNIC +------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability, if KVM_CHECK_EXTENSION indicates that it is +available, means that the kernel has an implementation of the +Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is +used to support Windows Hyper-V based guest paravirt drivers(VMBus). + +In order to use SynIC, it has to be activated by setting this +capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this +will disable the use of APIC hardware virtualization even if supported +by the CPU, as it's incompatible with SynIC auto-EOI behavior. + +6.77 KVM_CAP_HYPERV_SYNIC2 +-------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability enables a newer version of Hyper-V Synthetic interrupt +controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM +doesn't clear SynIC message and event flags pages when they are enabled by +writing to the respective MSRs. + +6.78 KVM_CAP_HYPERV_DIRECT_TLBFLUSH +----------------------------------- + +:Architectures: x86 +:Target: vcpu + +This capability indicates that KVM running on top of Hyper-V hypervisor +enables Direct TLB flush for its guests meaning that TLB flush +hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. +Due to the different ABI for hypercall parameters between Hyper-V and +KVM, enabling this capability effectively disables all hypercall +handling by KVM (as some KVM hypercall may be mistakenly treated as TLB +flush hypercalls by Hyper-V) so userspace should disable KVM identification +in CPUID and only exposes Hyper-V identification. In this case, guest +thinks it's running on Hyper-V and only use Hyper-V hypercalls. + +6.79 KVM_CAP_HYPERV_ENFORCE_CPUID +--------------------------------- + +:Architectures: x86 +:Target: vcpu + +When enabled, KVM will disable emulated Hyper-V features provided to the +guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all +currently implemented Hyper-V features are provided unconditionally when +Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) +leaf. + +6.80 KVM_CAP_ENFORCE_PV_FEATURE_CPUID +------------------------------------- + +:Architectures: x86 +:Target: vcpu + +When enabled, KVM will disable paravirtual features provided to the +guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf +(0x40000001). Otherwise, a guest may use the paravirtual features +regardless of what has actually been exposed through the CPUID leaf. + +.. _KVM_CAP_DIRTY_LOG_RING: + + +.. _cap_enable_vm: + 7. Capabilities that can be enabled on VMs ========================================== There are certain capabilities that change the behavior of the virtual -machine when enabled. To enable them, please see section 4.37. Below -you can find a list of capabilities and what their effect on the VM -is when enabling them. +machine when enabled. To enable them, please see section +:ref:`KVM_ENABLE_CAP`. Below you can find a list of capabilities and +what their effect on the VM is when enabling them. The following information is provided along with the description: @@ -7652,6 +7843,7 @@ branch to guests' 0x200 interrupt vector. :Architectures: x86 :Parameters: args[0] defines which exits are disabled :Returns: 0 on success, -EINVAL when args[0] contains invalid exits + or if any vCPUs have already been created Valid bits in args[0] are:: @@ -7659,6 +7851,7 @@ Valid bits in args[0] are:: #define KVM_X86_DISABLE_EXITS_HLT (1 << 1) #define KVM_X86_DISABLE_EXITS_PAUSE (1 << 2) #define KVM_X86_DISABLE_EXITS_CSTATE (1 << 3) + #define KVM_X86_DISABLE_EXITS_APERFMPERF (1 << 4) Enabling this capability on a VM provides userspace with a way to no longer intercept some instructions for improved latency in some @@ -7669,6 +7862,28 @@ all such vmexits. Do not enable KVM_FEATURE_PV_UNHALT if you disable HLT exits. +Virtualizing the ``IA32_APERF`` and ``IA32_MPERF`` MSRs requires more +than just disabling APERF/MPERF exits. While both Intel and AMD +document strict usage conditions for these MSRs--emphasizing that only +the ratio of their deltas over a time interval (T0 to T1) is +architecturally defined--simply passing through the MSRs can still +produce an incorrect ratio. + +This erroneous ratio can occur if, between T0 and T1: + +1. The vCPU thread migrates between logical processors. +2. Live migration or suspend/resume operations take place. +3. Another task shares the vCPU's logical processor. +4. C-states lower than C0 are emulated (e.g., via HLT interception). +5. The guest TSC frequency doesn't match the host TSC frequency. + +Due to these complexities, KVM does not automatically associate this +passthrough capability with the guest CPUID bit, +``CPUID.6:ECX.APERFMPERF[bit 0]``. Userspace VMMs that deem this +mechanism adequate for virtualizing the ``IA32_APERF`` and +``IA32_MPERF`` MSRs must set the guest CPUID bit explicitly. + + 7.14 KVM_CAP_S390_HPAGE_1M -------------------------- @@ -7880,6 +8095,11 @@ apply some other policy-based mitigation. When exiting to userspace, KVM sets KVM_RUN_X86_BUS_LOCK in vcpu-run->flags, and conditionally sets the exit_reason to KVM_EXIT_X86_BUS_LOCK. +Due to differences in the underlying hardware implementation, the vCPU's RIP at +the time of exit diverges between Intel and AMD. On Intel hosts, RIP points at +the next instruction, i.e. the exit is trap-like. On AMD hosts, RIP points at +the offending instruction, i.e. the exit is fault-like. + Note! Detected bus locks may be coincident with other exits to userspace, i.e. KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason if userspace wants to take action on all detected bus locks. @@ -7898,10 +8118,10 @@ by POWER10 processor. 7.24 KVM_CAP_VM_COPY_ENC_CONTEXT_FROM ------------------------------------- -Architectures: x86 SEV enabled -Type: vm -Parameters: args[0] is the fd of the source vm -Returns: 0 on success; ENOTTY on error +:Architectures: x86 SEV enabled +:Type: vm +:Parameters: args[0] is the fd of the source vm +:Returns: 0 on success; ENOTTY on error This capability enables userspace to copy encryption context from the vm indicated by the fd to the vm this is called on. @@ -7934,24 +8154,6 @@ default. See Documentation/arch/x86/sgx.rst for more details. -7.26 KVM_CAP_PPC_RPT_INVALIDATE -------------------------------- - -:Capability: KVM_CAP_PPC_RPT_INVALIDATE -:Architectures: ppc -:Type: vm - -This capability indicates that the kernel is capable of handling -H_RPT_INVALIDATE hcall. - -In order to enable the use of H_RPT_INVALIDATE in the guest, -user space might have to advertise it for the guest. For example, -IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is -present in the "ibm,hypertas-functions" device-tree property. - -This capability is enabled for hypervisors on platforms like POWER9 -that support radix MMU. - 7.27 KVM_CAP_EXIT_ON_EMULATION_FAILURE -------------------------------------- @@ -8009,24 +8211,9 @@ indicated by the fd to the VM this is called on. This is intended to support intra-host migration of VMs between userspace VMMs, upgrading the VMM process without interrupting the guest. -7.30 KVM_CAP_PPC_AIL_MODE_3 -------------------------------- - -:Capability: KVM_CAP_PPC_AIL_MODE_3 -:Architectures: ppc -:Type: vm - -This capability indicates that the kernel supports the mode 3 setting for the -"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" -resource that is controlled with the H_SET_MODE hypercall. - -This capability allows a guest kernel to use a better-performance mode for -handling interrupts and system calls. - 7.31 KVM_CAP_DISABLE_QUIRKS2 ---------------------------- -:Capability: KVM_CAP_DISABLE_QUIRKS2 :Parameters: args[0] - set of KVM quirks to disable :Architectures: x86 :Type: vm @@ -8107,6 +8294,50 @@ KVM_X86_QUIRK_SLOT_ZAP_ALL By default, for KVM_X86_DEFAULT_VM VMs, KVM or moved memslot isn't reachable, i.e KVM _may_ invalidate only SPTEs related to the memslot. + +KVM_X86_QUIRK_STUFF_FEATURE_MSRS By default, at vCPU creation, KVM sets the + vCPU's MSR_IA32_PERF_CAPABILITIES (0x345), + MSR_IA32_ARCH_CAPABILITIES (0x10a), + MSR_PLATFORM_INFO (0xce), and all VMX MSRs + (0x480..0x492) to the maximal capabilities + supported by KVM. KVM also sets + MSR_IA32_UCODE_REV (0x8b) to an arbitrary + value (which is different for Intel vs. + AMD). Lastly, when guest CPUID is set (by + userspace), KVM modifies select VMX MSR + fields to force consistency between guest + CPUID and L2's effective ISA. When this + quirk is disabled, KVM zeroes the vCPU's MSR + values (with two exceptions, see below), + i.e. treats the feature MSRs like CPUID + leaves and gives userspace full control of + the vCPU model definition. This quirk does + not affect VMX MSRs CR0/CR4_FIXED1 (0x487 + and 0x489), as KVM does now allow them to + be set by userspace (KVM sets them based on + guest CPUID, for safety purposes). + +KVM_X86_QUIRK_IGNORE_GUEST_PAT By default, on Intel platforms, KVM ignores + guest PAT and forces the effective memory + type to WB in EPT. The quirk is not available + on Intel platforms which are incapable of + safely honoring guest PAT (i.e., without CPU + self-snoop, KVM always ignores guest PAT and + forces effective memory type to WB). It is + also ignored on AMD platforms or, on Intel, + when a VM has non-coherent DMA devices + assigned; KVM always honors guest PAT in + such case. The quirk is needed to avoid + slowdowns on certain Intel Xeon platforms + (e.g. ICX, SPR) where self-snoop feature is + supported but UC is slow enough to cause + issues with some older guests that use + UC instead of WC to map the video RAM. + Userspace can disable the quirk to honor + guest PAT if it knows that there is no such + guest software, for example if it does not + expose a bochs graphics device (which is + known to have had a buggy driver). =================================== ============================================ 7.32 KVM_CAP_MAX_VCPU_ID @@ -8159,27 +8390,6 @@ This capability is aimed to mitigate the threat that malicious VMs can cause CPU stuck (due to event windows don't open up) and make the CPU unavailable to host or other VMs. -7.34 KVM_CAP_MEMORY_FAULT_INFO ------------------------------- - -:Architectures: x86 -:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. - -The presence of this capability indicates that KVM_RUN will fill -kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if -there is a valid memslot but no backing VMA for the corresponding host virtual -address. - -The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns -an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set -to KVM_EXIT_MEMORY_FAULT. - -Note: Userspaces which attempt to resolve memory faults so that they can retry -KVM_RUN are encouraged to guard against repeatedly receiving the same -error/annotated fault. - -See KVM_EXIT_MEMORY_FAULT for more information. - 7.35 KVM_CAP_X86_APIC_BUS_CYCLES_NS ----------------------------------- @@ -8197,19 +8407,260 @@ by KVM_CHECK_EXTENSION. Note: Userspace is responsible for correctly configuring CPUID 0x15, a.k.a. the core crystal clock frequency, if a non-zero CPUID 0x15 is exposed to the guest. -7.36 KVM_CAP_X86_GUEST_MODE ------------------------------- +7.36 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL +---------------------------------------------------------- + +:Architectures: x86, arm64, riscv +:Type: vm +:Parameters: args[0] - size of the dirty log ring + +KVM is capable of tracking dirty memory using ring buffers that are +mmapped into userspace; there is one dirty ring per vcpu. + +The dirty ring is available to userspace as an array of +``struct kvm_dirty_gfn``. Each dirty entry is defined as:: + + struct kvm_dirty_gfn { + __u32 flags; + __u32 slot; /* as_id | slot_id */ + __u64 offset; + }; + +The following values are defined for the flags field to define the +current state of the entry:: + + #define KVM_DIRTY_GFN_F_DIRTY BIT(0) + #define KVM_DIRTY_GFN_F_RESET BIT(1) + #define KVM_DIRTY_GFN_F_MASK 0x3 + +Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM +ioctl to enable this capability for the new guest and set the size of +the rings. Enabling the capability is only allowed before creating any +vCPU, and the size of the ring must be a power of two. The larger the +ring buffer, the less likely the ring is full and the VM is forced to +exit to userspace. The optimal size depends on the workload, but it is +recommended that it be at least 64 KiB (4096 entries). + +Just like for dirty page bitmaps, the buffer tracks writes to +all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was +set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered +with the flag set, userspace can start harvesting dirty pages from the +ring buffer. + +An entry in the ring buffer can be unused (flag bits ``00``), +dirty (flag bits ``01``) or harvested (flag bits ``1X``). The +state machine for the entry is as follows:: + + dirtied harvested reset + 00 -----------> 01 -------------> 1X -------+ + ^ | + | | + +------------------------------------------+ + +To harvest the dirty pages, userspace accesses the mmapped ring buffer +to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage +the RESET bit must be cleared), then it means this GFN is a dirty GFN. +The userspace should harvest this GFN and mark the flags from state +``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set +to show that this GFN is harvested and waiting for a reset), and move +on to the next GFN. The userspace should continue to do this until the +flags of a GFN have the DIRTY bit cleared, meaning that it has harvested +all the dirty GFNs that were available. + +Note that on weakly ordered architectures, userspace accesses to the +ring buffer (and more specifically the 'flags' field) must be ordered, +using load-acquire/store-release accessors when available, or any +other memory barrier that will ensure this ordering. + +It's not necessary for userspace to harvest the all dirty GFNs at once. +However it must collect the dirty GFNs in sequence, i.e., the userspace +program cannot skip one dirty GFN to collect the one next to it. + +After processing one or more entries in the ring buffer, userspace +calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about +it, so that the kernel will reprotect those collected GFNs. +Therefore, the ioctl must be called *before* reading the content of +the dirty pages. + +The dirty ring can get full. When it happens, the KVM_RUN of the +vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. + +The dirty ring interface has a major difference comparing to the +KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from +userspace, it's still possible that the kernel has not yet flushed the +processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the +flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one +needs to kick the vcpu out of KVM_RUN using a signal. The resulting +vmexit ensures that all dirty GFNs are flushed to the dirty rings. + +NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that +should be exposed by weakly ordered architecture, in order to indicate +the additional memory ordering requirements imposed on userspace when +reading the state of an entry and mutating it from DIRTY to HARVESTED. +Architecture with TSO-like ordering (such as x86) are allowed to +expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL +to userspace. + +After enabling the dirty rings, the userspace needs to detect the +capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the +ring structures can be backed by per-slot bitmaps. With this capability +advertised, it means the architecture can dirty guest pages without +vcpu/ring context, so that some of the dirty information will still be +maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP +can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL +hasn't been enabled, or any memslot has been existing. + +Note that the bitmap here is only a backup of the ring structure. The +use of the ring and bitmap combination is only beneficial if there is +only a very small amount of memory that is dirtied out of vcpu/ring +context. Otherwise, the stand-alone per-slot bitmap mechanism needs to +be considered. + +To collect dirty bits in the backup bitmap, userspace can use the same +KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all +the generation of the dirty bits is done in a single pass. Collecting +the dirty bitmap should be the very last thing that the VMM does before +considering the state as complete. VMM needs to ensure that the dirty +state is final and avoid missing dirty pages from another ioctl ordered +after the bitmap collection. + +NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its +tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on +KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through +command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device +"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save +vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} +command on KVM device "kvm-arm-vgic-v3". + +7.37 KVM_CAP_PMU_CAPABILITY +--------------------------- :Architectures: x86 -:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. +:Type: vm +:Parameters: arg[0] is bitmask of PMU virtualization capabilities. +:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits -The presence of this capability indicates that KVM_RUN will update the -KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the -vCPU was executing nested guest code when it exited. +This capability alters PMU virtualization in KVM. -KVM exits with the register state of either the L1 or L2 guest -depending on which executed at the time of an exit. Userspace must -take care to differentiate between these cases. +Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of +PMU virtualization capabilities that can be adjusted on a VM. + +The argument to KVM_ENABLE_CAP is also a bitmask and selects specific +PMU virtualization capabilities to be applied to the VM. This can +only be invoked on a VM prior to the creation of VCPUs. + +At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting +this capability will disable PMU virtualization for that VM. Usermode +should adjust CPUID leaf 0xA to reflect that the PMU is disabled. + +7.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES +------------------------------------- + +:Architectures: x86 +:Type: vm +:Parameters: arg[0] must be 0. +:Returns: 0 on success, -EPERM if the userspace process does not + have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been + created. + +This capability disables the NX huge pages mitigation for iTLB MULTIHIT. + +The capability has no effect if the nx_huge_pages module parameter is not set. + +This capability may only be set before any vCPUs are created. + +7.39 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE +--------------------------------------- + +:Architectures: arm64 +:Type: vm +:Parameters: arg[0] is the new split chunk size. +:Returns: 0 on success, -EINVAL if any memslot was already created. + +This capability sets the chunk size used in Eager Page Splitting. + +Eager Page Splitting improves the performance of dirty-logging (used +in live migrations) when guest memory is backed by huge-pages. It +avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing +it eagerly when enabling dirty logging (with the +KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using +KVM_CLEAR_DIRTY_LOG. + +The chunk size specifies how many pages to break at a time, using a +single allocation for each chunk. Bigger the chunk size, more pages +need to be allocated ahead of time. + +The chunk size needs to be a valid block size. The list of acceptable +block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a +64-bit bitmap (each bit describing a block size). The default value is +0, to disable the eager page splitting. + +7.40 KVM_CAP_EXIT_HYPERCALL +--------------------------- + +:Architectures: x86 +:Type: vm + +This capability, if enabled, will cause KVM to exit to userspace +with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. + +Calling KVM_CHECK_EXTENSION for this capability will return a bitmask +of hypercalls that can be configured to exit to userspace. +Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. + +The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset +of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace +the hypercalls whose corresponding bit is in the argument, and return +ENOSYS for the others. + +7.41 KVM_CAP_ARM_SYSTEM_SUSPEND +------------------------------- + +:Architectures: arm64 +:Type: vm + +When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of +type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. + +7.42 KVM_CAP_ARM_WRITABLE_IMP_ID_REGS +------------------------------------- + +:Architectures: arm64 +:Target: VM +:Parameters: None +:Returns: 0 on success, -EINVAL if vCPUs have been created before enabling this + capability. + +This capability changes the behavior of the registers that identify a PE +implementation of the Arm architecture: MIDR_EL1, REVIDR_EL1, and AIDR_EL1. +By default, these registers are visible to userspace but treated as invariant. + +When this capability is enabled, KVM allows userspace to change the +aforementioned registers before the first KVM_RUN. These registers are VM +scoped, meaning that the same set of values are presented on all vCPUs in a +given VM. + +7.43 KVM_CAP_RISCV_MP_STATE_RESET +--------------------------------- + +:Architectures: riscv +:Type: VM +:Parameters: None +:Returns: 0 on success, -EINVAL if arg[0] is not zero + +When this capability is enabled, KVM resets the VCPU when setting +MP_STATE_INIT_RECEIVED through IOCTL. The original MP_STATE is preserved. + +7.43 KVM_CAP_ARM_CACHEABLE_PFNMAP_SUPPORTED +------------------------------------------- + +:Architectures: arm64 +:Target: VM +:Parameters: None + +This capability indicate to the userspace whether a PFNMAP memory region +can be safely mapped as cacheable. This relies on the presence of +force write back (FWB) feature support on the hardware. 8. Other capabilities. ====================== @@ -8228,21 +8679,6 @@ H_RANDOM hypercall backed by a hardware random-number generator. If present, the kernel H_RANDOM handler can be enabled for guest use with the KVM_CAP_PPC_ENABLE_HCALL capability. -8.2 KVM_CAP_HYPERV_SYNIC ------------------------- - -:Architectures: x86 - -This capability, if KVM_CHECK_EXTENSION indicates that it is -available, means that the kernel has an implementation of the -Hyper-V Synthetic interrupt controller(SynIC). Hyper-V SynIC is -used to support Windows Hyper-V based guest paravirt drivers(VMBus). - -In order to use SynIC, it has to be activated by setting this -capability via KVM_ENABLE_CAP ioctl on the vcpu fd. Note that this -will disable the use of APIC hardware virtualization even if supported -by the CPU, as it's incompatible with SynIC auto-EOI behavior. - 8.3 KVM_CAP_PPC_MMU_RADIX ------------------------- @@ -8293,20 +8729,6 @@ may be incompatible with the MIPS VZ ASE. virtualization, including standard guest virtual memory segments. == ========================================================================== -8.6 KVM_CAP_MIPS_TE -------------------- - -:Architectures: mips - -This capability, if KVM_CHECK_EXTENSION on the main kvm handle indicates that -it is available, means that the trap & emulate implementation is available to -run guest code in user mode, even if KVM_CAP_MIPS_VZ indicates that hardware -assisted virtualisation is also available. KVM_VM_MIPS_TE (0) must be passed -to KVM_CREATE_VM to create a VM which utilises it. - -If KVM_CHECK_EXTENSION on a kvm VM handle indicates that this capability is -available, it means that the VM is using trap & emulate. - 8.7 KVM_CAP_MIPS_64BIT ---------------------- @@ -8388,16 +8810,6 @@ virtual SMT modes that can be set using KVM_CAP_PPC_SMT. If bit N (counting from the right) is set, then a virtual SMT mode of 2^N is available. -8.11 KVM_CAP_HYPERV_SYNIC2 --------------------------- - -:Architectures: x86 - -This capability enables a newer version of Hyper-V Synthetic interrupt -controller (SynIC). The only difference with KVM_CAP_HYPERV_SYNIC is that KVM -doesn't clear SynIC message and event flags pages when they are enabled by -writing to the respective MSRs. - 8.12 KVM_CAP_HYPERV_VP_INDEX ---------------------------- @@ -8412,7 +8824,6 @@ capability is absent, userspace can still query this msr's value. ------------------------------- :Architectures: s390 -:Parameters: none This capability indicates if the flic device will be able to get/set the AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows @@ -8486,21 +8897,6 @@ This capability indicates that KVM supports paravirtualized Hyper-V IPI send hypercalls: HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx. -8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH ------------------------------------ - -:Architectures: x86 - -This capability indicates that KVM running on top of Hyper-V hypervisor -enables Direct TLB flush for its guests meaning that TLB flush -hypercalls are handled by Level 0 hypervisor (Hyper-V) bypassing KVM. -Due to the different ABI for hypercall parameters between Hyper-V and -KVM, enabling this capability effectively disables all hypercall -handling by KVM (as some KVM hypercall may be mistakenly treated as TLB -flush hypercalls by Hyper-V) so userspace should disable KVM identification -in CPUID and only exposes Hyper-V identification. In this case, guest -thinks it's running on Hyper-V and only use Hyper-V hypercalls. - 8.22 KVM_CAP_S390_VCPU_RESETS ----------------------------- @@ -8578,140 +8974,6 @@ In combination with KVM_CAP_X86_USER_SPACE_MSR, this allows user space to trap and emulate MSRs that are outside of the scope of KVM as well as limit the attack surface on KVM's MSR emulation code. -8.28 KVM_CAP_ENFORCE_PV_FEATURE_CPUID -------------------------------------- - -Architectures: x86 - -When enabled, KVM will disable paravirtual features provided to the -guest according to the bits in the KVM_CPUID_FEATURES CPUID leaf -(0x40000001). Otherwise, a guest may use the paravirtual features -regardless of what has actually been exposed through the CPUID leaf. - -8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL ----------------------------------------------------------- - -:Architectures: x86, arm64 -:Parameters: args[0] - size of the dirty log ring - -KVM is capable of tracking dirty memory using ring buffers that are -mmapped into userspace; there is one dirty ring per vcpu. - -The dirty ring is available to userspace as an array of -``struct kvm_dirty_gfn``. Each dirty entry is defined as:: - - struct kvm_dirty_gfn { - __u32 flags; - __u32 slot; /* as_id | slot_id */ - __u64 offset; - }; - -The following values are defined for the flags field to define the -current state of the entry:: - - #define KVM_DIRTY_GFN_F_DIRTY BIT(0) - #define KVM_DIRTY_GFN_F_RESET BIT(1) - #define KVM_DIRTY_GFN_F_MASK 0x3 - -Userspace should call KVM_ENABLE_CAP ioctl right after KVM_CREATE_VM -ioctl to enable this capability for the new guest and set the size of -the rings. Enabling the capability is only allowed before creating any -vCPU, and the size of the ring must be a power of two. The larger the -ring buffer, the less likely the ring is full and the VM is forced to -exit to userspace. The optimal size depends on the workload, but it is -recommended that it be at least 64 KiB (4096 entries). - -Just like for dirty page bitmaps, the buffer tracks writes to -all user memory regions for which the KVM_MEM_LOG_DIRTY_PAGES flag was -set in KVM_SET_USER_MEMORY_REGION. Once a memory region is registered -with the flag set, userspace can start harvesting dirty pages from the -ring buffer. - -An entry in the ring buffer can be unused (flag bits ``00``), -dirty (flag bits ``01``) or harvested (flag bits ``1X``). The -state machine for the entry is as follows:: - - dirtied harvested reset - 00 -----------> 01 -------------> 1X -------+ - ^ | - | | - +------------------------------------------+ - -To harvest the dirty pages, userspace accesses the mmapped ring buffer -to read the dirty GFNs. If the flags has the DIRTY bit set (at this stage -the RESET bit must be cleared), then it means this GFN is a dirty GFN. -The userspace should harvest this GFN and mark the flags from state -``01b`` to ``1Xb`` (bit 0 will be ignored by KVM, but bit 1 must be set -to show that this GFN is harvested and waiting for a reset), and move -on to the next GFN. The userspace should continue to do this until the -flags of a GFN have the DIRTY bit cleared, meaning that it has harvested -all the dirty GFNs that were available. - -Note that on weakly ordered architectures, userspace accesses to the -ring buffer (and more specifically the 'flags' field) must be ordered, -using load-acquire/store-release accessors when available, or any -other memory barrier that will ensure this ordering. - -It's not necessary for userspace to harvest the all dirty GFNs at once. -However it must collect the dirty GFNs in sequence, i.e., the userspace -program cannot skip one dirty GFN to collect the one next to it. - -After processing one or more entries in the ring buffer, userspace -calls the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about -it, so that the kernel will reprotect those collected GFNs. -Therefore, the ioctl must be called *before* reading the content of -the dirty pages. - -The dirty ring can get full. When it happens, the KVM_RUN of the -vcpu will return with exit reason KVM_EXIT_DIRTY_LOG_FULL. - -The dirty ring interface has a major difference comparing to the -KVM_GET_DIRTY_LOG interface in that, when reading the dirty ring from -userspace, it's still possible that the kernel has not yet flushed the -processor's dirty page buffers into the kernel buffer (with dirty bitmaps, the -flushing is done by the KVM_GET_DIRTY_LOG ioctl). To achieve that, one -needs to kick the vcpu out of KVM_RUN using a signal. The resulting -vmexit ensures that all dirty GFNs are flushed to the dirty rings. - -NOTE: KVM_CAP_DIRTY_LOG_RING_ACQ_REL is the only capability that -should be exposed by weakly ordered architecture, in order to indicate -the additional memory ordering requirements imposed on userspace when -reading the state of an entry and mutating it from DIRTY to HARVESTED. -Architecture with TSO-like ordering (such as x86) are allowed to -expose both KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL -to userspace. - -After enabling the dirty rings, the userspace needs to detect the -capability of KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP to see whether the -ring structures can be backed by per-slot bitmaps. With this capability -advertised, it means the architecture can dirty guest pages without -vcpu/ring context, so that some of the dirty information will still be -maintained in the bitmap structure. KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP -can't be enabled if the capability of KVM_CAP_DIRTY_LOG_RING_ACQ_REL -hasn't been enabled, or any memslot has been existing. - -Note that the bitmap here is only a backup of the ring structure. The -use of the ring and bitmap combination is only beneficial if there is -only a very small amount of memory that is dirtied out of vcpu/ring -context. Otherwise, the stand-alone per-slot bitmap mechanism needs to -be considered. - -To collect dirty bits in the backup bitmap, userspace can use the same -KVM_GET_DIRTY_LOG ioctl. KVM_CLEAR_DIRTY_LOG isn't needed as long as all -the generation of the dirty bits is done in a single pass. Collecting -the dirty bitmap should be the very last thing that the VMM does before -considering the state as complete. VMM needs to ensure that the dirty -state is final and avoid missing dirty pages from another ioctl ordered -after the bitmap collection. - -NOTE: Multiple examples of using the backup bitmap: (1) save vgic/its -tables through command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_SAVE_TABLES} on -KVM device "kvm-arm-vgic-its". (2) restore vgic/its tables through -command KVM_DEV_ARM_{VGIC_GRP_CTRL, ITS_RESTORE_TABLES} on KVM device -"kvm-arm-vgic-its". VGICv3 LPI pending status is restored. (3) save -vgic3 pending table through KVM_DEV_ARM_VGIC_{GRP_CTRL, SAVE_PENDING_TABLES} -command on KVM device "kvm-arm-vgic-v3". - 8.30 KVM_CAP_XEN_HVM -------------------- @@ -8776,10 +9038,9 @@ clearing the PVCLOCK_TSC_STABLE_BIT flag in Xen pvclock sources. This will be done when the KVM_CAP_XEN_HVM ioctl sets the KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE flag. -8.31 KVM_CAP_PPC_MULTITCE -------------------------- +8.31 KVM_CAP_SPAPR_MULTITCE +--------------------------- -:Capability: KVM_CAP_PPC_MULTITCE :Architectures: ppc :Type: vm @@ -8811,72 +9072,9 @@ This capability indicates that the KVM virtual PTP service is supported in the host. A VMM can check whether the service is available to the guest on migration. -8.33 KVM_CAP_HYPERV_ENFORCE_CPUID ---------------------------------- - -Architectures: x86 - -When enabled, KVM will disable emulated Hyper-V features provided to the -guest according to the bits Hyper-V CPUID feature leaves. Otherwise, all -currently implemented Hyper-V features are provided unconditionally when -Hyper-V identification is set in the HYPERV_CPUID_INTERFACE (0x40000001) -leaf. - -8.34 KVM_CAP_EXIT_HYPERCALL ---------------------------- - -:Capability: KVM_CAP_EXIT_HYPERCALL -:Architectures: x86 -:Type: vm - -This capability, if enabled, will cause KVM to exit to userspace -with KVM_EXIT_HYPERCALL exit reason to process some hypercalls. - -Calling KVM_CHECK_EXTENSION for this capability will return a bitmask -of hypercalls that can be configured to exit to userspace. -Right now, the only such hypercall is KVM_HC_MAP_GPA_RANGE. - -The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset -of the result of KVM_CHECK_EXTENSION. KVM will forward to userspace -the hypercalls whose corresponding bit is in the argument, and return -ENOSYS for the others. - -8.35 KVM_CAP_PMU_CAPABILITY ---------------------------- - -:Capability: KVM_CAP_PMU_CAPABILITY -:Architectures: x86 -:Type: vm -:Parameters: arg[0] is bitmask of PMU virtualization capabilities. -:Returns: 0 on success, -EINVAL when arg[0] contains invalid bits - -This capability alters PMU virtualization in KVM. - -Calling KVM_CHECK_EXTENSION for this capability returns a bitmask of -PMU virtualization capabilities that can be adjusted on a VM. - -The argument to KVM_ENABLE_CAP is also a bitmask and selects specific -PMU virtualization capabilities to be applied to the VM. This can -only be invoked on a VM prior to the creation of VCPUs. - -At this time, KVM_PMU_CAP_DISABLE is the only capability. Setting -this capability will disable PMU virtualization for that VM. Usermode -should adjust CPUID leaf 0xA to reflect that the PMU is disabled. - -8.36 KVM_CAP_ARM_SYSTEM_SUSPEND -------------------------------- - -:Capability: KVM_CAP_ARM_SYSTEM_SUSPEND -:Architectures: arm64 -:Type: vm - -When enabled, KVM will exit to userspace with KVM_EXIT_SYSTEM_EVENT of -type KVM_SYSTEM_EVENT_SUSPEND to process the guest suspend request. - 8.37 KVM_CAP_S390_PROTECTED_DUMP -------------------------------- -:Capability: KVM_CAP_S390_PROTECTED_DUMP :Architectures: s390 :Type: vm @@ -8886,27 +9084,9 @@ PV guests. The `KVM_PV_DUMP` command is available for the dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is available and supports the `KVM_PV_DUMP_CPU` subcommand. -8.38 KVM_CAP_VM_DISABLE_NX_HUGE_PAGES -------------------------------------- - -:Capability: KVM_CAP_VM_DISABLE_NX_HUGE_PAGES -:Architectures: x86 -:Type: vm -:Parameters: arg[0] must be 0. -:Returns: 0 on success, -EPERM if the userspace process does not - have CAP_SYS_BOOT, -EINVAL if args[0] is not 0 or any vCPUs have been - created. - -This capability disables the NX huge pages mitigation for iTLB MULTIHIT. - -The capability has no effect if the nx_huge_pages module parameter is not set. - -This capability may only be set before any vCPUs are created. - 8.39 KVM_CAP_S390_CPU_TOPOLOGY ------------------------------ -:Capability: KVM_CAP_S390_CPU_TOPOLOGY :Architectures: s390 :Type: vm @@ -8928,37 +9108,9 @@ structure. When getting the Modified Change Topology Report value, the attr->addr must point to a byte where the value will be stored or retrieved from. -8.40 KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE ---------------------------------------- - -:Capability: KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE -:Architectures: arm64 -:Type: vm -:Parameters: arg[0] is the new split chunk size. -:Returns: 0 on success, -EINVAL if any memslot was already created. - -This capability sets the chunk size used in Eager Page Splitting. - -Eager Page Splitting improves the performance of dirty-logging (used -in live migrations) when guest memory is backed by huge-pages. It -avoids splitting huge-pages (into PAGE_SIZE pages) on fault, by doing -it eagerly when enabling dirty logging (with the -KVM_MEM_LOG_DIRTY_PAGES flag for a memory region), or when using -KVM_CLEAR_DIRTY_LOG. - -The chunk size specifies how many pages to break at a time, using a -single allocation for each chunk. Bigger the chunk size, more pages -need to be allocated ahead of time. - -The chunk size needs to be a valid block size. The list of acceptable -block sizes is exposed in KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES as a -64-bit bitmap (each bit describing a block size). The default value is -0, to disable the eager page splitting. - 8.41 KVM_CAP_VM_TYPES --------------------- -:Capability: KVM_CAP_MEMORY_ATTRIBUTES :Architectures: x86 :Type: system ioctl @@ -8975,6 +9127,67 @@ Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in production. The behavior and effective ABI for software-protected VMs is unstable. +8.42 KVM_CAP_PPC_RPT_INVALIDATE +------------------------------- + +:Architectures: ppc + +This capability indicates that the kernel is capable of handling +H_RPT_INVALIDATE hcall. + +In order to enable the use of H_RPT_INVALIDATE in the guest, +user space might have to advertise it for the guest. For example, +IBM pSeries (sPAPR) guest starts using it if "hcall-rpt-invalidate" is +present in the "ibm,hypertas-functions" device-tree property. + +This capability is enabled for hypervisors on platforms like POWER9 +that support radix MMU. + +8.43 KVM_CAP_PPC_AIL_MODE_3 +--------------------------- + +:Architectures: ppc + +This capability indicates that the kernel supports the mode 3 setting for the +"Address Translation Mode on Interrupt" aka "Alternate Interrupt Location" +resource that is controlled with the H_SET_MODE hypercall. + +This capability allows a guest kernel to use a better-performance mode for +handling interrupts and system calls. + +8.44 KVM_CAP_MEMORY_FAULT_INFO +------------------------------ + +:Architectures: x86 + +The presence of this capability indicates that KVM_RUN will fill +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, e.g. if +there is a valid memslot but no backing VMA for the corresponding host virtual +address. + +The information in kvm_run.memory_fault is valid if and only if KVM_RUN returns +an error with errno=EFAULT or errno=EHWPOISON *and* kvm_run.exit_reason is set +to KVM_EXIT_MEMORY_FAULT. + +Note: Userspaces which attempt to resolve memory faults so that they can retry +KVM_RUN are encouraged to guard against repeatedly receiving the same +error/annotated fault. + +See KVM_EXIT_MEMORY_FAULT for more information. + +8.45 KVM_CAP_X86_GUEST_MODE +--------------------------- + +:Architectures: x86 + +The presence of this capability indicates that KVM_RUN will update the +KVM_RUN_X86_GUEST_MODE bit in kvm_run.flags to indicate whether the +vCPU was executing nested guest code when it exited. + +KVM exits with the register state of either the L1 or L2 guest +depending on which executed at the time of an exit. Userspace must +take care to differentiate between these cases. + 9. Known KVM API problems ========================= @@ -9005,9 +9218,10 @@ the local APIC. The same is true for the ``KVM_FEATURE_PV_UNHALT`` paravirtualized feature. -CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``. -It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel -has enabled in-kernel emulation of the local APIC. +On older versions of Linux, CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by +``KVM_GET_SUPPORTED_CPUID``, but it can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` +is present and the kernel has enabled in-kernel emulation of the local APIC. +On newer versions, ``KVM_GET_SUPPORTED_CPUID`` does report the bit as available. CPU topology ~~~~~~~~~~~~ diff --git a/Documentation/virt/kvm/arm/fw-pseudo-registers.rst b/Documentation/virt/kvm/arm/fw-pseudo-registers.rst index b90fd0b0fa66..d78b53b05dfc 100644 --- a/Documentation/virt/kvm/arm/fw-pseudo-registers.rst +++ b/Documentation/virt/kvm/arm/fw-pseudo-registers.rst @@ -116,7 +116,7 @@ The pseudo-firmware bitmap register are as follows: ARM DEN0057A. * KVM_REG_ARM_VENDOR_HYP_BMAP: - Controls the bitmap of the Vendor specific Hypervisor Service Calls. + Controls the bitmap of the Vendor specific Hypervisor Service Calls[0-63]. The following bits are accepted: @@ -127,6 +127,19 @@ The pseudo-firmware bitmap register are as follows: Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_PTP: The bit represents the Precision Time Protocol KVM service. +* KVM_REG_ARM_VENDOR_HYP_BMAP_2: + Controls the bitmap of the Vendor specific Hypervisor Service Calls[64-127]. + + The following bits are accepted: + + Bit-0: KVM_REG_ARM_VENDOR_HYP_BIT_DISCOVER_IMPL_VER + This represents the ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_VER_FUNC_ID + function-id. This is reset to 0. + + Bit-1: KVM_REG_ARM_VENDOR_HYP_BIT_DISCOVER_IMPL_CPUS + This represents the ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_CPUS_FUNC_ID + function-id. This is reset to 0. + Errors: ======= ============================================================= diff --git a/Documentation/virt/kvm/arm/hypercalls.rst b/Documentation/virt/kvm/arm/hypercalls.rst index af7bc2c2e0cb..7400a89aabf8 100644 --- a/Documentation/virt/kvm/arm/hypercalls.rst +++ b/Documentation/virt/kvm/arm/hypercalls.rst @@ -142,3 +142,62 @@ region is equal to the memory protection granule advertised by | | | +---------------------------------------------+ | | | | ``INVALID_PARAMETER (-3)`` | +---------------------+----------+----+---------------------------------------------+ + +``ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_VER_FUNC_ID`` +------------------------------------------------------- +Request the target CPU implementation version information and the number of target +implementations for the Guest VM. + ++---------------------+-------------------------------------------------------------+ +| Presence: | Optional; KVM/ARM64 Guests only | ++---------------------+-------------------------------------------------------------+ +| Calling convention: | HVC64 | ++---------------------+----------+--------------------------------------------------+ +| Function ID: | (uint32) | 0xC6000040 | ++---------------------+----------+--------------------------------------------------+ +| Arguments: | None | ++---------------------+----------+----+---------------------------------------------+ +| Return Values: | (int64) | R0 | ``SUCCESS (0)`` | +| | | +---------------------------------------------+ +| | | | ``NOT_SUPPORTED (-1)`` | +| +----------+----+---------------------------------------------+ +| | (uint64) | R1 | Bits [63:32] Reserved/Must be zero | +| | | +---------------------------------------------+ +| | | | Bits [31:16] Major version | +| | | +---------------------------------------------+ +| | | | Bits [15:0] Minor version | +| +----------+----+---------------------------------------------+ +| | (uint64) | R2 | Number of target implementations | +| +----------+----+---------------------------------------------+ +| | (uint64) | R3 | Reserved / Must be zero | ++---------------------+----------+----+---------------------------------------------+ + +``ARM_SMCCC_VENDOR_HYP_KVM_DISCOVER_IMPL_CPUS_FUNC_ID`` +------------------------------------------------------- + +Request the target CPU implementation information for the Guest VM. The Guest kernel +will use this information to enable the associated errata. + ++---------------------+-------------------------------------------------------------+ +| Presence: | Optional; KVM/ARM64 Guests only | ++---------------------+-------------------------------------------------------------+ +| Calling convention: | HVC64 | ++---------------------+----------+--------------------------------------------------+ +| Function ID: | (uint32) | 0xC6000041 | ++---------------------+----------+----+---------------------------------------------+ +| Arguments: | (uint64) | R1 | selected implementation index | +| +----------+----+---------------------------------------------+ +| | (uint64) | R2 | Reserved / Must be zero | +| +----------+----+---------------------------------------------+ +| | (uint64) | R3 | Reserved / Must be zero | ++---------------------+----------+----+---------------------------------------------+ +| Return Values: | (int64) | R0 | ``SUCCESS (0)`` | +| | | +---------------------------------------------+ +| | | | ``INVALID_PARAMETER (-3)`` | +| +----------+----+---------------------------------------------+ +| | (uint64) | R1 | MIDR_EL1 of the selected implementation | +| +----------+----+---------------------------------------------+ +| | (uint64) | R2 | REVIDR_EL1 of the selected implementation | +| +----------+----+---------------------------------------------+ +| | (uint64) | R3 | AIDR_EL1 of the selected implementation | ++---------------------+----------+----+---------------------------------------------+ diff --git a/Documentation/virt/kvm/devices/arm-vgic-its.rst b/Documentation/virt/kvm/devices/arm-vgic-its.rst index e053124f77c4..0f971f83205a 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-its.rst +++ b/Documentation/virt/kvm/devices/arm-vgic-its.rst @@ -126,7 +126,8 @@ KVM_DEV_ARM_VGIC_GRP_ITS_REGS ITS Restore Sequence: --------------------- -The following ordering must be followed when restoring the GIC and the ITS: +The following ordering must be followed when restoring the GIC, ITS, and +KVM_IRQFD assignments: a) restore all guest memory and create vcpus b) restore all redistributors @@ -139,6 +140,8 @@ d) restore the ITS in the following order: 3. Load the ITS table data (KVM_DEV_ARM_ITS_RESTORE_TABLES) 4. Restore GITS_CTLR +e) restore KVM_IRQFD assignments for MSIs + Then vcpus can be started. ITS Table ABI REV0: diff --git a/Documentation/virt/kvm/devices/arm-vgic-v3.rst b/Documentation/virt/kvm/devices/arm-vgic-v3.rst index 5817edb4e046..ff02102f7141 100644 --- a/Documentation/virt/kvm/devices/arm-vgic-v3.rst +++ b/Documentation/virt/kvm/devices/arm-vgic-v3.rst @@ -78,6 +78,8 @@ Groups: -ENXIO The group or attribute is unknown/unsupported for this device or hardware support is missing. -EFAULT Invalid user pointer for attr->addr. + -EBUSY Attempt to write a register that is read-only after + initialization ======= ============================================================= @@ -120,6 +122,12 @@ Groups: Note that distributor fields are not banked, but return the same value regardless of the mpidr used to access the register. + Userspace is allowed to write the following register fields prior to + initialization of the VGIC: + + * GICD_IIDR.Revision + * GICD_TYPER2.nASSGIcap + GICD_IIDR.Revision is updated when the KVM implementation is changed in a way directly observable by the guest or userspace. Userspace should read GICD_IIDR from KVM and write back the read value to confirm its expected @@ -128,6 +136,12 @@ Groups: behavior. + GICD_TYPER2.nASSGIcap allows userspace to control the support of SGIs + without an active state. At VGIC creation the field resets to the + maximum capability of the system. Userspace is expected to read the field + to determine the supported value(s) before writing to the field. + + The GICD_STATUSR and GICR_STATUSR registers are architecturally defined such that a write of a clear bit has no effect, whereas a write with a set bit clears that value. To allow userspace to freely set the values of these two @@ -202,16 +216,69 @@ Groups: KVM_DEV_ARM_VGIC_GRP_CPU_SYSREGS accesses the CPU interface registers for the CPU specified by the mpidr field. - CPU interface registers access is not implemented for AArch32 mode. - Error -ENXIO is returned when accessed in AArch32 mode. + The available registers are: + + =============== ==================================================== + ICC_PMR_EL1 + ICC_BPR0_EL1 + ICC_AP0R0_EL1 + ICC_AP0R1_EL1 when the host implements at least 6 bits of priority + ICC_AP0R2_EL1 when the host implements 7 bits of priority + ICC_AP0R3_EL1 when the host implements 7 bits of priority + ICC_AP1R0_EL1 + ICC_AP1R1_EL1 when the host implements at least 6 bits of priority + ICC_AP1R2_EL1 when the host implements 7 bits of priority + ICC_AP1R3_EL1 when the host implements 7 bits of priority + ICC_BPR1_EL1 + ICC_CTLR_EL1 + ICC_SRE_EL1 + ICC_IGRPEN0_EL1 + ICC_IGRPEN1_EL1 + =============== ==================================================== + + When EL2 is available for the guest, these registers are also available: + + ============= ==================================================== + ICH_AP0R0_EL2 + ICH_AP0R1_EL2 when the host implements at least 6 bits of priority + ICH_AP0R2_EL2 when the host implements 7 bits of priority + ICH_AP0R3_EL2 when the host implements 7 bits of priority + ICH_AP1R0_EL2 + ICH_AP1R1_EL2 when the host implements at least 6 bits of priority + ICH_AP1R2_EL2 when the host implements 7 bits of priority + ICH_AP1R3_EL2 when the host implements 7 bits of priority + ICH_HCR_EL2 + ICC_SRE_EL2 + ICH_VTR_EL2 + ICH_VMCR_EL2 + ICH_LR0_EL2 + ICH_LR1_EL2 + ICH_LR2_EL2 + ICH_LR3_EL2 + ICH_LR4_EL2 + ICH_LR5_EL2 + ICH_LR6_EL2 + ICH_LR7_EL2 + ICH_LR8_EL2 + ICH_LR9_EL2 + ICH_LR10_EL2 + ICH_LR11_EL2 + ICH_LR12_EL2 + ICH_LR13_EL2 + ICH_LR14_EL2 + ICH_LR15_EL2 + ============= ==================================================== + + CPU interface registers are only described using the AArch64 + encoding. Errors: - ======= ===================================================== - -ENXIO Getting or setting this register is not yet supported + ======= ================================================= + -ENXIO Getting or setting this register is not supported -EBUSY VCPU is running -EINVAL Invalid mpidr or register value supplied - ======= ===================================================== + ======= ================================================= KVM_DEV_ARM_VGIC_GRP_NR_IRQS @@ -291,8 +358,18 @@ Groups: | Aff3 | Aff2 | Aff1 | Aff0 | Errors: - ======= ============================================= -EINVAL vINTID is not multiple of 32 or info field is not VGIC_LEVEL_INFO_LINE_LEVEL ======= ============================================= + + KVM_DEV_ARM_VGIC_GRP_MAINT_IRQ + Attributes: + + The attr field of kvm_device_attr encodes the following values: + + bits: | 31 .... 5 | 4 .... 0 | + values: | RES0 | vINTID | + + The vINTID specifies which interrupt is generated when the vGIC + must generate a maintenance interrupt. This must be a PPI. diff --git a/Documentation/virt/kvm/devices/s390_flic.rst b/Documentation/virt/kvm/devices/s390_flic.rst index ea96559ba501..b784f8016748 100644 --- a/Documentation/virt/kvm/devices/s390_flic.rst +++ b/Documentation/virt/kvm/devices/s390_flic.rst @@ -58,11 +58,15 @@ Groups: Enables async page faults for the guest. So in case of a major page fault the host is allowed to handle this async and continues the guest. + -EINVAL is returned when called on the FLIC of a ucontrol VM. + KVM_DEV_FLIC_APF_DISABLE_WAIT Disables async page faults for the guest and waits until already pending async page faults are done. This is necessary to trigger a completion interrupt for every init interrupt before migrating the interrupt list. + -EINVAL is returned when called on the FLIC of a ucontrol VM. + KVM_DEV_FLIC_ADAPTER_REGISTER Register an I/O adapter interrupt source. Takes a kvm_s390_io_adapter describing the adapter to register:: diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst index 31f14ec4a65b..60bf205cb373 100644 --- a/Documentation/virt/kvm/devices/vcpu.rst +++ b/Documentation/virt/kvm/devices/vcpu.rst @@ -137,13 +137,37 @@ exit_reason = KVM_EXIT_FAIL_ENTRY and populate the fail_entry struct by setting hardare_entry_failure_reason field to KVM_EXIT_FAIL_ENTRY_CPU_UNSUPPORTED and the cpu field to the processor id. +1.5 ATTRIBUTE: KVM_ARM_VCPU_PMU_V3_SET_NR_COUNTERS +-------------------------------------------------- + +:Parameters: in kvm_device_attr.addr the address to an unsigned int + representing the maximum value taken by PMCR_EL0.N + +:Returns: + + ======= ==================================================== + -EBUSY PMUv3 already initialized, a VCPU has already run or + an event filter has already been set + -EFAULT Error accessing the value pointed to by addr + -ENODEV PMUv3 not supported or GIC not initialized + -EINVAL No PMUv3 explicitly selected, or value of N out of + range + ======= ==================================================== + +Set the number of implemented event counters in the virtual PMU. This +mandates that a PMU has explicitly been selected via +KVM_ARM_VCPU_PMU_V3_SET_PMU, and will fail when no PMU has been +explicitly selected, or the number of counters is out of range for the +selected PMU. Selecting a new PMU cancels the effect of setting this +attribute. + 2. GROUP: KVM_ARM_VCPU_TIMER_CTRL ================================= :Architectures: ARM64 -2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_VTIMER, KVM_ARM_VCPU_TIMER_IRQ_PTIMER ------------------------------------------------------------------------------ +2.1. ATTRIBUTES: KVM_ARM_VCPU_TIMER_IRQ_{VTIMER,PTIMER,HVTIMER,HPTIMER} +----------------------------------------------------------------------- :Parameters: in kvm_device_attr.addr the address for the timer interrupt is a pointer to an int @@ -159,10 +183,12 @@ A value describing the architected timer interrupt number when connected to an in-kernel virtual GIC. These must be a PPI (16 <= intid < 32). Setting the attribute overrides the default values (see below). -============================= ========================================== -KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27) -KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30) -============================= ========================================== +============================== ========================================== +KVM_ARM_VCPU_TIMER_IRQ_VTIMER The EL1 virtual timer intid (default: 27) +KVM_ARM_VCPU_TIMER_IRQ_PTIMER The EL1 physical timer intid (default: 30) +KVM_ARM_VCPU_TIMER_IRQ_HVTIMER The EL2 virtual timer intid (default: 28) +KVM_ARM_VCPU_TIMER_IRQ_HPTIMER The EL2 physical timer intid (default: 26) +============================== ========================================== Setting the same PPI for different timers will prevent the VCPUs from running. Setting the interrupt number on a VCPU configures all VCPUs created at that diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst index 1bedd56e2fe3..ae8bce7fecbe 100644 --- a/Documentation/virt/kvm/locking.rst +++ b/Documentation/virt/kvm/locking.rst @@ -135,8 +135,8 @@ We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. For direct sp, we can easily avoid it since the spte of direct sp is fixed to gfn. For indirect sp, we disabled fast page fault for simplicity. -A solution for indirect sp could be to pin the gfn, for example via -gfn_to_pfn_memslot_atomic, before the cmpxchg. After the pinning: +A solution for indirect sp could be to pin the gfn before the cmpxchg. After +the pinning: - We have held the refcount of pfn; that means the pfn can not be freed and be reused for another gfn. @@ -147,54 +147,56 @@ Then, we can ensure the dirty bitmaps is correctly set for a gfn. 2) Dirty bit tracking -In the origin code, the spte can be fast updated (non-atomically) if the +In the original code, the spte can be fast updated (non-atomically) if the spte is read-only and the Accessed bit has already been set since the Accessed bit and Dirty bit can not be lost. But it is not true after fast page fault since the spte can be marked writable between reading spte and updating spte. Like below case: -+------------------------------------------------------------------------+ -| At the beginning:: | -| | -| spte.W = 0 | -| spte.Accessed = 1 | -+------------------------------------+-----------------------------------+ -| CPU 0: | CPU 1: | -+------------------------------------+-----------------------------------+ -| In mmu_spte_clear_track_bits():: | | -| | | -| old_spte = *spte; | | -| | | -| | | -| /* 'if' condition is satisfied. */| | -| if (old_spte.Accessed == 1 && | | -| old_spte.W == 0) | | -| spte = 0ull; | | -+------------------------------------+-----------------------------------+ -| | on fast page fault path:: | -| | | -| | spte.W = 1 | -| | | -| | memory write on the spte:: | -| | | -| | spte.Dirty = 1 | -+------------------------------------+-----------------------------------+ -| :: | | -| | | -| else | | -| old_spte = xchg(spte, 0ull) | | -| if (old_spte.Accessed == 1) | | -| kvm_set_pfn_accessed(spte.pfn);| | -| if (old_spte.Dirty == 1) | | -| kvm_set_pfn_dirty(spte.pfn); | | -| OOPS!!! | | -+------------------------------------+-----------------------------------+ ++-------------------------------------------------------------------------+ +| At the beginning:: | +| | +| spte.W = 0 | +| spte.Accessed = 1 | ++-------------------------------------+-----------------------------------+ +| CPU 0: | CPU 1: | ++-------------------------------------+-----------------------------------+ +| In mmu_spte_update():: | | +| | | +| old_spte = *spte; | | +| | | +| | | +| /* 'if' condition is satisfied. */ | | +| if (old_spte.Accessed == 1 && | | +| old_spte.W == 0) | | +| spte = new_spte; | | ++-------------------------------------+-----------------------------------+ +| | on fast page fault path:: | +| | | +| | spte.W = 1 | +| | | +| | memory write on the spte:: | +| | | +| | spte.Dirty = 1 | ++-------------------------------------+-----------------------------------+ +| :: | | +| | | +| else | | +| old_spte = xchg(spte, new_spte);| | +| if (old_spte.Accessed && | | +| !new_spte.Accessed) | | +| flush = true; | | +| if (old_spte.Dirty && | | +| !new_spte.Dirty) | | +| flush = true; | | +| OOPS!!! | | ++-------------------------------------+-----------------------------------+ The Dirty bit is lost in this case. In order to avoid this kind of issue, we always treat the spte as "volatile" -if it can be updated out of mmu-lock [see spte_has_volatile_bits()]; it means +if it can be updated out of mmu-lock [see spte_needs_atomic_update()]; it means the spte is always atomically updated in this case. 3) flush tlbs due to spte updated @@ -210,7 +212,7 @@ function to update spte (present -> present). Since the spte is "volatile" if it can be updated out of mmu-lock, we always atomically update the spte and the race caused by fast page fault can be avoided. -See the comments in spte_has_volatile_bits() and mmu_spte_update(). +See the comments in spte_needs_atomic_update() and mmu_spte_update(). Lockless Access Tracking: diff --git a/Documentation/virt/kvm/review-checklist.rst b/Documentation/virt/kvm/review-checklist.rst index dc01aea4057b..debac54e14e7 100644 --- a/Documentation/virt/kvm/review-checklist.rst +++ b/Documentation/virt/kvm/review-checklist.rst @@ -7,7 +7,7 @@ Review checklist for kvm patches 1. The patch must follow Documentation/process/coding-style.rst and Documentation/process/submitting-patches.rst. -2. Patches should be against kvm.git master branch. +2. Patches should be against kvm.git master or next branches. 3. If the patch introduces or modifies a new userspace API: - the API must be documented in Documentation/virt/kvm/api.rst @@ -18,10 +18,10 @@ Review checklist for kvm patches 5. New features must default to off (userspace should explicitly request them). Performance improvements can and should default to on. -6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2 +6. New cpu features should be exposed via KVM_GET_SUPPORTED_CPUID2, + or its equivalent for non-x86 architectures -7. Emulator changes should be accompanied by unit tests for qemu-kvm.git - kvm/test directory. +7. The feature should be testable (see below). 8. Changes should be vendor neutral when possible. Changes to common code are better than duplicating changes to vendor code. @@ -36,6 +36,87 @@ Review checklist for kvm patches 11. New guest visible features must either be documented in a hardware manual or be accompanied by documentation. -12. Features must be robust against reset and kexec - for example, shared - host/guest memory must be unshared to prevent the host from writing to - guest memory that the guest has not reserved for this purpose. +Testing of KVM code +------------------- + +All features contributed to KVM, and in many cases bugfixes too, should be +accompanied by some kind of tests and/or enablement in open source guests +and VMMs. KVM is covered by multiple test suites: + +*Selftests* + These are low level tests that allow granular testing of kernel APIs. + This includes API failure scenarios, invoking APIs after specific + guest instructions, and testing multiple calls to ``KVM_CREATE_VM`` + within a single test. They are included in the kernel tree at + ``tools/testing/selftests/kvm``. + +``kvm-unit-tests`` + A collection of small guests that test CPU and emulated device features + from a guest's perspective. They run under QEMU or ``kvmtool``, and + are generally not KVM-specific: they can be run with any accelerator + that QEMU support or even on bare metal, making it possible to compare + behavior across hypervisors and processor families. + +Functional test suites + Various sets of functional tests exist, such as QEMU's ``tests/functional`` + suite and `avocado-vt <https://avocado-vt.readthedocs.io/en/latest/>`__. + These typically involve running a full operating system in a virtual + machine. + +The best testing approach depends on the feature's complexity and +operation. Here are some examples and guidelines: + +New instructions (no new registers or APIs) + The corresponding CPU features (if applicable) should be made available + in QEMU. If the instructions require emulation support or other code in + KVM, it is worth adding coverage to ``kvm-unit-tests`` or selftests; + the latter can be a better choice if the instructions relate to an API + that already has good selftest coverage. + +New hardware features (new registers, no new APIs) + These should be tested via ``kvm-unit-tests``; this more or less implies + supporting them in QEMU and/or ``kvmtool``. In some cases selftests + can be used instead, similar to the previous case, or specifically to + test corner cases in guest state save/restore. + +Bug fixes and performance improvements + These usually do not introduce new APIs, but it's worth sharing + any benchmarks and tests that will validate your contribution, + ideally in the form of regression tests. Tests and benchmarks + can be included in either ``kvm-unit-tests`` or selftests, depending + on the specifics of your change. Selftests are especially useful for + regression tests because they are included directly in Linux's tree. + +Large scale internal changes + While it's difficult to provide a single policy, you should ensure that + the changed code is covered by either ``kvm-unit-tests`` or selftests. + In some cases the affected code is run for any guests and functional + tests suffice. Explain your testing process in the cover letter, + as that can help identify gaps in existing test suites. + +New APIs + It is important to demonstrate your use case. This can be as simple as + explaining that the feature is already in use on bare metal, or it can be + a proof-of-concept implementation in userspace. The latter need not be + open source, though that is of course preferrable for easier testing. + Selftests should test corner cases of the APIs, and should also cover + basic host and guest operation if no open source VMM uses the feature. + +Bigger features, usually spanning host and guest + These should be supported by Linux guests, with limited exceptions for + Hyper-V features that are testable on Windows guests. It is strongly + suggested that the feature be usable with an open source host VMM, such + as at least one of QEMU or crosvm, and guest firmware. Selftests should + test at least API error cases. Guest operation can be covered by + either selftests of ``kvm-unit-tests`` (this is especially important for + paravirtualized and Windows-only features). Strong selftest coverage + can also be a replacement for implementation in an open source VMM, + but this is generally not recommended. + +Following the above suggestions for testing in selftests and +``kvm-unit-tests`` will make it easier for the maintainers to review +and accept your code. In fact, even before you contribute your changes +upstream it will make it easier for you to develop for KVM. + +Of course, the KVM maintainers reserve the right to require more tests, +though they may also waive the requirement from time to time. diff --git a/Documentation/virt/kvm/x86/errata.rst b/Documentation/virt/kvm/x86/errata.rst index 4116045a8744..37c79362a48f 100644 --- a/Documentation/virt/kvm/x86/errata.rst +++ b/Documentation/virt/kvm/x86/errata.rst @@ -33,6 +33,18 @@ Note however that any software (e.g ``WIN87EM.DLL``) expecting these features to be present likely predates these CPUID feature bits, and therefore doesn't know to check for them anyway. +``KVM_SET_VCPU_EVENTS`` issue +----------------------------- + +Invalid KVM_SET_VCPU_EVENTS input with respect to error codes *may* result in +failed VM-Entry on Intel CPUs. Pre-CET Intel CPUs require that exception +injection through the VMCS correctly set the "error code valid" flag, e.g. +require the flag be set when injecting a #GP, clear when injecting a #UD, +clear when injecting a soft exception, etc. Intel CPUs that enumerate +IA32_VMX_BASIC[56] as '1' relax VMX's consistency checks, and AMD CPUs have no +restrictions whatsoever. KVM_SET_VCPU_EVENTS doesn't sanity check the vector +versus "has_error_code", i.e. KVM's ABI follows AMD behavior. + Nested virtualization features ------------------------------ diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst index 9ece6b8dc817..851e99174762 100644 --- a/Documentation/virt/kvm/x86/index.rst +++ b/Documentation/virt/kvm/x86/index.rst @@ -11,6 +11,7 @@ KVM for x86 systems cpuid errata hypercalls + intel-tdx mmu msr nested-vmx diff --git a/Documentation/virt/kvm/x86/intel-tdx.rst b/Documentation/virt/kvm/x86/intel-tdx.rst new file mode 100644 index 000000000000..5efac62c92c7 --- /dev/null +++ b/Documentation/virt/kvm/x86/intel-tdx.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Intel Trust Domain Extensions (TDX) +=================================== + +Overview +======== +Intel's Trust Domain Extensions (TDX) protect confidential guest VMs from the +host and physical attacks. A CPU-attested software module called 'the TDX +module' runs inside a new CPU isolated range to provide the functionalities to +manage and run protected VMs, a.k.a, TDX guests or TDs. + +Please refer to [1] for the whitepaper, specifications and other resources. + +This documentation describes TDX-specific KVM ABIs. The TDX module needs to be +initialized before it can be used by KVM to run any TDX guests. The host +core-kernel provides the support of initializing the TDX module, which is +described in the Documentation/arch/x86/tdx.rst. + +API description +=============== + +KVM_MEMORY_ENCRYPT_OP +--------------------- +:Type: vm ioctl, vcpu ioctl + +For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic +ioctl with TDX specific sub-ioctl() commands. + +:: + + /* Trust Domain Extensions sub-ioctl() commands. */ + enum kvm_tdx_cmd_id { + KVM_TDX_CAPABILITIES = 0, + KVM_TDX_INIT_VM, + KVM_TDX_INIT_VCPU, + KVM_TDX_INIT_MEM_REGION, + KVM_TDX_FINALIZE_VM, + KVM_TDX_GET_CPUID, + + KVM_TDX_CMD_NR_MAX, + }; + + struct kvm_tdx_cmd { + /* enum kvm_tdx_cmd_id */ + __u32 id; + /* flags for sub-command. If sub-command doesn't use this, set zero. */ + __u32 flags; + /* + * data for each sub-command. An immediate or a pointer to the actual + * data in process virtual address. If sub-command doesn't use it, + * set zero. + */ + __u64 data; + /* + * Auxiliary error code. The sub-command may return TDX SEAMCALL + * status code in addition to -Exxx. + */ + __u64 hw_error; + }; + +KVM_TDX_CAPABILITIES +-------------------- +:Type: vm ioctl +:Returns: 0 on success, <0 on error + +Return the TDX capabilities that current KVM supports with the specific TDX +module loaded in the system. It reports what features/capabilities are allowed +to be configured to the TDX guest. + +- id: KVM_TDX_CAPABILITIES +- flags: must be 0 +- data: pointer to struct kvm_tdx_capabilities +- hw_error: must be 0 + +:: + + struct kvm_tdx_capabilities { + __u64 supported_attrs; + __u64 supported_xfam; + + /* TDG.VP.VMCALL hypercalls executed in kernel and forwarded to + * userspace, respectively + */ + __u64 kernel_tdvmcallinfo_1_r11; + __u64 user_tdvmcallinfo_1_r11; + + /* TDG.VP.VMCALL instruction executions subfunctions executed in kernel + * and forwarded to userspace, respectively + */ + __u64 kernel_tdvmcallinfo_1_r12; + __u64 user_tdvmcallinfo_1_r12; + + __u64 reserved[250]; + + /* Configurable CPUID bits for userspace */ + struct kvm_cpuid2 cpuid; + }; + + +KVM_TDX_INIT_VM +--------------- +:Type: vm ioctl +:Returns: 0 on success, <0 on error + +Perform TDX specific VM initialization. This needs to be called after +KVM_CREATE_VM and before creating any VCPUs. + +- id: KVM_TDX_INIT_VM +- flags: must be 0 +- data: pointer to struct kvm_tdx_init_vm +- hw_error: must be 0 + +:: + + struct kvm_tdx_init_vm { + __u64 attributes; + __u64 xfam; + __u64 mrconfigid[6]; /* sha384 digest */ + __u64 mrowner[6]; /* sha384 digest */ + __u64 mrownerconfig[6]; /* sha384 digest */ + + /* The total space for TD_PARAMS before the CPUIDs is 256 bytes */ + __u64 reserved[12]; + + /* + * Call KVM_TDX_INIT_VM before vcpu creation, thus before + * KVM_SET_CPUID2. + * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the + * TDX module directly virtualizes those CPUIDs without VMM. The user + * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with + * those values. If it doesn't, KVM may have wrong idea of vCPUIDs of + * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX + * module doesn't virtualize. + */ + struct kvm_cpuid2 cpuid; + }; + + +KVM_TDX_INIT_VCPU +----------------- +:Type: vcpu ioctl +:Returns: 0 on success, <0 on error + +Perform TDX specific VCPU initialization. + +- id: KVM_TDX_INIT_VCPU +- flags: must be 0 +- data: initial value of the guest TD VCPU RCX +- hw_error: must be 0 + +KVM_TDX_INIT_MEM_REGION +----------------------- +:Type: vcpu ioctl +:Returns: 0 on success, <0 on error + +Initialize @nr_pages TDX guest private memory starting from @gpa with userspace +provided data from @source_addr. + +Note, before calling this sub command, memory attribute of the range +[gpa, gpa + nr_pages] needs to be private. Userspace can use +KVM_SET_MEMORY_ATTRIBUTES to set the attribute. + +If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement. + +- id: KVM_TDX_INIT_MEM_REGION +- flags: currently only KVM_TDX_MEASURE_MEMORY_REGION is defined +- data: pointer to struct kvm_tdx_init_mem_region +- hw_error: must be 0 + +:: + + #define KVM_TDX_MEASURE_MEMORY_REGION (1UL << 0) + + struct kvm_tdx_init_mem_region { + __u64 source_addr; + __u64 gpa; + __u64 nr_pages; + }; + + +KVM_TDX_FINALIZE_VM +------------------- +:Type: vm ioctl +:Returns: 0 on success, <0 on error + +Complete measurement of the initial TD contents and mark it ready to run. + +- id: KVM_TDX_FINALIZE_VM +- flags: must be 0 +- data: must be 0 +- hw_error: must be 0 + + +KVM_TDX_GET_CPUID +----------------- +:Type: vcpu ioctl +:Returns: 0 on success, <0 on error + +Get the CPUID values that the TDX module virtualizes for the TD guest. +When it returns -E2BIG, the user space should allocate a larger buffer and +retry. The minimum buffer size is updated in the nent field of the +struct kvm_cpuid2. + +- id: KVM_TDX_GET_CPUID +- flags: must be 0 +- data: pointer to struct kvm_cpuid2 (in/out) +- hw_error: must be 0 (out) + +:: + + struct kvm_cpuid2 { + __u32 nent; + __u32 padding; + struct kvm_cpuid_entry2 entries[0]; + }; + + struct kvm_cpuid_entry2 { + __u32 function; + __u32 index; + __u32 flags; + __u32 eax; + __u32 ebx; + __u32 ecx; + __u32 edx; + __u32 padding[3]; + }; + +KVM TDX creation flow +===================== +In addition to the standard KVM flow, new TDX ioctls need to be called. The +control flow is as follows: + +#. Check system wide capability + + * KVM_CAP_VM_TYPES: Check if VM type is supported and if KVM_X86_TDX_VM + is supported. + +#. Create VM + + * KVM_CREATE_VM + * KVM_TDX_CAPABILITIES: Query TDX capabilities for creating TDX guests. + * KVM_CHECK_EXTENSION(KVM_CAP_MAX_VCPUS): Query maximum VCPUs the TD can + support at VM level (TDX has its own limitation on this). + * KVM_SET_TSC_KHZ: Configure TD's TSC frequency if a different TSC frequency + than host is desired. This is Optional. + * KVM_TDX_INIT_VM: Pass TDX specific VM parameters. + +#. Create VCPU + + * KVM_CREATE_VCPU + * KVM_TDX_INIT_VCPU: Pass TDX specific VCPU parameters. + * KVM_SET_CPUID2: Configure TD's CPUIDs. + * KVM_SET_MSRS: Configure TD's MSRs. + +#. Initialize initial guest memory + + * Prepare content of initial guest memory. + * KVM_TDX_INIT_MEM_REGION: Add initial guest memory. + * KVM_TDX_FINALIZE_VM: Finalize the measurement of the TDX guest. + +#. Run VCPU + +References +========== + +https://www.intel.com/content/www/us/en/developer/tools/trust-domain-extensions/documentation.html diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst index 584000b743f3..c37e8e594d12 100644 --- a/Documentation/virt/uml/user_mode_linux_howto_v2.rst +++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst @@ -147,18 +147,12 @@ The image hostname will be set to the same as the host on which you are creating its image. It is a good idea to change that to avoid "Oh, bummer, I rebooted the wrong machine". -UML supports two classes of network devices - the older uml_net ones -which are scheduled for obsoletion. These are called ethX. It also -supports the newer vector IO devices which are significantly faster -and have support for some standard virtual network encapsulations like -Ethernet over GRE and Ethernet over L2TPv3. These are called vec0. +UML supports vector I/O high performance network devices which have +support for some standard virtual network encapsulations like +Ethernet over GRE and Ethernet over L2TPv3. These are called vecX. -Depending on which one is in use, ``/etc/network/interfaces`` will -need entries like:: - - # legacy UML network devices - auto eth0 - iface eth0 inet dhcp +When vector network devices are in use, ``/etc/network/interfaces`` +will need entries like:: # vector UML network devices auto vec0 @@ -219,16 +213,6 @@ remote UML and other VM instances. +-----------+--------+------------------------------------+------------+ | vde | vector | dep. on VDE VPN: Virt.Net Locator | varies | +-----------+--------+------------------------------------+------------+ -| tuntap | legacy | none | ~ 500Mbit | -+-----------+--------+------------------------------------+------------+ -| daemon | legacy | none | ~ 450Mbit | -+-----------+--------+------------------------------------+------------+ -| socket | legacy | none | ~ 450Mbit | -+-----------+--------+------------------------------------+------------+ -| ethertap | legacy | obsolete | ~ 500Mbit | -+-----------+--------+------------------------------------+------------+ -| vde | legacy | obsolete | ~ 500Mbit | -+-----------+--------+------------------------------------+------------+ * All transports which have tso and checksum offloads can deliver speeds approaching 10G on TCP streams. @@ -236,27 +220,16 @@ remote UML and other VM instances. * All transports which have multi-packet rx and/or tx can deliver pps rates of up to 1Mps or more. -* All legacy transports are generally limited to ~600-700MBit and 0.05Mps. - * GRE and L2TPv3 allow connections to all of: local machine, remote machines, remote network devices and remote UML instances. -* Socket allows connections only between UML instances. - -* Daemon and bess require running a local switch. This switch may be - connected to the host as well. - Network configuration privileges ================================ The majority of the supported networking modes need ``root`` privileges. -For example, in the legacy tuntap networking mode, users were required -to be part of the group associated with the tunnel device. - -For newer network drivers like the vector transports, ``root`` privilege -is required to fire an ioctl to setup the tun interface and/or use -raw sockets where needed. +For example, for vector transports, ``root`` privilege is required to fire +an ioctl to setup the tun interface and/or use raw sockets where needed. This can be achieved by granting the user a particular capability instead of running UML as root. In case of vector transport, a user can add the @@ -610,12 +583,6 @@ connect to a local area cloud (all the UML nodes using the same multicast address running on hosts in the same multicast domain (LAN) will be automagically connected together to a virtual LAN. -Configuring Legacy transports -============================= - -Legacy transports are now considered obsolete. Please use the vector -versions. - *********** Running UML *********** |