diff options
author | Jason Gunthorpe <jgg@nvidia.com> | 2022-02-24 17:20:18 +0300 |
---|---|---|
committer | Leon Romanovsky <leonro@nvidia.com> | 2022-03-03 13:57:39 +0300 |
commit | 115dcec65f61d53e25e1bed5e380468b30f98b14 (patch) | |
tree | 521de2561eb5f663afa742e033713658abb1a32d /drivers/vfio | |
parent | 445ad495f0ff71553693e6a2a9d7a3bc6917ca36 (diff) | |
download | linux-115dcec65f61d53e25e1bed5e380468b30f98b14.tar.xz |
vfio: Define device migration protocol v2
Replace the existing region based migration protocol with an ioctl based
protocol. The two protocols have the same general semantic behaviors, but
the way the data is transported is changed.
This is the STOP_COPY portion of the new protocol, it defines the 5 states
for basic stop and copy migration and the protocol to move the migration
data in/out of the kernel.
Compared to the clarification of the v1 protocol Alex proposed:
https://lore.kernel.org/r/163909282574.728533.7460416142511440919.stgit@omen
This has a few deliberate functional differences:
- ERROR arcs allow the device function to remain unchanged.
- The protocol is not required to return to the original state on
transition failure. Instead userspace can execute an unwind back to
the original state, reset, or do something else without needing kernel
support. This simplifies the kernel design and should userspace choose
a policy like always reset, avoids doing useless work in the kernel
on error handling paths.
- PRE_COPY is made optional, userspace must discover it before using it.
This reflects the fact that the majority of drivers we are aware of
right now will not implement PRE_COPY.
- segmentation is not part of the data stream protocol, the receiver
does not have to reproduce the framing boundaries.
The hybrid FSM for the device_state is described as a Mealy machine by
documenting each of the arcs the driver is required to implement. Defining
the remaining set of old/new device_state transitions as 'combination
transitions' which are naturally defined as taking multiple FSM arcs along
the shortest path within the FSM's digraph allows a complete matrix of
transitions.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE is
defined to replace writing to the device_state field in the region. This
allows returning a brand new FD whenever the requested transition opens
a data transfer session.
The VFIO core code implements the new feature and provides a helper
function to the driver. Using the helper the driver only has to
implement 6 of the FSM arcs and the other combination transitions are
elaborated consistently from those arcs.
A new VFIO_DEVICE_FEATURE of VFIO_DEVICE_FEATURE_MIGRATION is defined to
report the capability for migration and indicate which set of states and
arcs are supported by the device. The FSM provides a lot of flexibility to
make backwards compatible extensions but the VFIO_DEVICE_FEATURE also
allows for future breaking extensions for scenarios that cannot support
even the basic STOP_COPY requirements.
The VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE with the GET option (i.e.
VFIO_DEVICE_FEATURE_GET) can be used to read the current migration state
of the VFIO device.
Data transfer sessions are now carried over a file descriptor, instead of
the region. The FD functions for the lifetime of the data transfer
session. read() and write() transfer the data with normal Linux stream FD
semantics. This design allows future expansion to support poll(),
io_uring, and other performance optimizations.
The complicated mmap mode for data transfer is discarded as current qemu
doesn't take meaningful advantage of it, and the new qemu implementation
avoids substantially all the performance penalty of using a read() on the
region.
Link: https://lore.kernel.org/all/20220224142024.147653-10-yishaih@nvidia.com
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Diffstat (limited to 'drivers/vfio')
-rw-r--r-- | drivers/vfio/vfio.c | 199 |
1 files changed, 199 insertions, 0 deletions
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 71763e2ac561..b37ab27b511f 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1557,6 +1557,197 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) return 0; } +/* + * vfio_mig_get_next_state - Compute the next step in the FSM + * @cur_fsm - The current state the device is in + * @new_fsm - The target state to reach + * @next_fsm - Pointer to the next step to get to new_fsm + * + * Return 0 upon success, otherwise -errno + * Upon success the next step in the state progression between cur_fsm and + * new_fsm will be set in next_fsm. + * + * This breaks down requests for combination transitions into smaller steps and + * returns the next step to get to new_fsm. The function may need to be called + * multiple times before reaching new_fsm. + * + */ +int vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm, + enum vfio_device_mig_state *next_fsm) +{ + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RESUMING + 1 }; + /* + * The coding in this table requires the driver to implement 6 + * FSM arcs: + * RESUMING -> STOP + * RUNNING -> STOP + * STOP -> RESUMING + * STOP -> RUNNING + * STOP -> STOP_COPY + * STOP_COPY -> STOP + * + * The coding will step through multiple states for these combination + * transitions: + * RESUMING -> STOP -> RUNNING + * RESUMING -> STOP -> STOP_COPY + * RUNNING -> STOP -> RESUMING + * RUNNING -> STOP -> STOP_COPY + * STOP_COPY -> STOP -> RESUMING + * STOP_COPY -> STOP -> RUNNING + */ + static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = { + [VFIO_DEVICE_STATE_STOP] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RUNNING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_STOP_COPY] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RESUMING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_ERROR] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + }; + + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table))) + return -EINVAL; + + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table)) + return -EINVAL; + + *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm]; + return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL; +} +EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); + +/* + * Convert the drivers's struct file into a FD number and return it to userspace + */ +static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg, + struct vfio_device_feature_mig_state *mig) +{ + int ret; + int fd; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (fd < 0) { + ret = fd; + goto out_fput; + } + + mig->data_fd = fd; + if (copy_to_user(arg, mig, sizeof(*mig))) { + ret = -EFAULT; + goto out_put_unused; + } + fd_install(fd, filp); + return 0; + +out_put_unused: + put_unused_fd(fd); +out_fput: + fput(filp); + return ret; +} + +static int +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + size_t minsz = + offsetofend(struct vfio_device_feature_mig_state, data_fd); + struct vfio_device_feature_mig_state mig; + struct file *filp = NULL; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_SET | + VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + + if (copy_from_user(&mig, arg, minsz)) + return -EFAULT; + + if (flags & VFIO_DEVICE_FEATURE_GET) { + enum vfio_device_mig_state curr_state; + + ret = device->ops->migration_get_state(device, &curr_state); + if (ret) + return ret; + mig.device_state = curr_state; + goto out_copy; + } + + /* Handle the VFIO_DEVICE_FEATURE_SET */ + filp = device->ops->migration_set_state(device, mig.device_state); + if (IS_ERR(filp) || !filp) + goto out_copy; + + return vfio_ioct_mig_return_fd(filp, arg, &mig); +out_copy: + mig.data_fd = -1; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + if (IS_ERR(filp)) + return PTR_ERR(filp); + return 0; +} + +static int vfio_ioctl_device_feature_migration(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + struct vfio_device_feature_migration mig = { + .flags = VFIO_MIGRATION_STOP_COPY, + }; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + return 0; +} + static int vfio_ioctl_device_feature(struct vfio_device *device, struct vfio_device_feature __user *arg) { @@ -1582,6 +1773,14 @@ static int vfio_ioctl_device_feature(struct vfio_device *device, return -EINVAL; switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + return vfio_ioctl_device_feature_migration( + device, feature.flags, arg->data, + feature.argsz - minsz); + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: + return vfio_ioctl_device_feature_mig_device_state( + device, feature.flags, arg->data, + feature.argsz - minsz); default: if (unlikely(!device->ops->device_feature)) return -EINVAL; |