diff options
40 files changed, 3558 insertions, 358 deletions
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index c57c609ad2eb..a7b0223e2886 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -103,6 +103,7 @@ available subsections can be seen below. sync_file vfio-mediated-device vfio + vfio-pci-device-specific-driver-acceptance xilinx/index xillybus zorro diff --git a/Documentation/driver-api/vfio-pci-device-specific-driver-acceptance.rst b/Documentation/driver-api/vfio-pci-device-specific-driver-acceptance.rst new file mode 100644 index 000000000000..b7b99b876b50 --- /dev/null +++ b/Documentation/driver-api/vfio-pci-device-specific-driver-acceptance.rst @@ -0,0 +1,35 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Acceptance criteria for vfio-pci device specific driver variants +================================================================ + +Overview +-------- +The vfio-pci driver exists as a device agnostic driver using the +system IOMMU and relying on the robustness of platform fault +handling to provide isolated device access to userspace. While the +vfio-pci driver does include some device specific support, further +extensions for yet more advanced device specific features are not +sustainable. The vfio-pci driver has therefore split out +vfio-pci-core as a library that may be reused to implement features +requiring device specific knowledge, ex. saving and loading device +state for the purposes of supporting migration. + +In support of such features, it's expected that some device specific +variants may interact with parent devices (ex. SR-IOV PF in support of +a user assigned VF) or other extensions that may not be otherwise +accessible via the vfio-pci base driver. Authors of such drivers +should be diligent not to create exploitable interfaces via these +interactions or allow unchecked userspace data to have an effect +beyond the scope of the assigned device. + +New driver submissions are therefore requested to have approval via +sign-off/ack/review/etc for any interactions with parent drivers. +Additionally, drivers should make an attempt to provide sufficient +documentation for reviewers to understand the device specific +extensions, for example in the case of migration data, how is the +device state composed and consumed, which portions are not otherwise +available to the user via vfio-pci, what safeguards exist to validate +the data, etc. To that extent, authors should additionally expect to +require reviews from at least one of the listed reviewers, in addition +to the overall vfio maintainer. diff --git a/Documentation/maintainer/maintainer-entry-profile.rst b/Documentation/maintainer/maintainer-entry-profile.rst index 5d5cc3acdf85..93b2ae6c34a9 100644 --- a/Documentation/maintainer/maintainer-entry-profile.rst +++ b/Documentation/maintainer/maintainer-entry-profile.rst @@ -103,3 +103,4 @@ to do something different in the near future. ../nvdimm/maintainer-entry-profile ../riscv/patch-acceptance ../driver-api/media/maintainer-entry-profile + ../driver-api/vfio-pci-device-specific-driver-acceptance diff --git a/MAINTAINERS b/MAINTAINERS index 40fa38a11209..3230ad5db30c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8722,9 +8722,9 @@ L: linux-crypto@vger.kernel.org S: Maintained F: Documentation/ABI/testing/debugfs-hisi-zip F: drivers/crypto/hisilicon/qm.c -F: drivers/crypto/hisilicon/qm.h F: drivers/crypto/hisilicon/sgl.c F: drivers/crypto/hisilicon/zip/ +F: include/linux/hisi_acc_qm.h HISILICON ROCE DRIVER M: Wenpeng Liang <liangwenpeng@huawei.com> @@ -20399,6 +20399,13 @@ L: kvm@vger.kernel.org S: Maintained F: drivers/vfio/fsl-mc/ +VFIO HISILICON PCI DRIVER +M: Longfang Liu <liulongfang@huawei.com> +M: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> +L: kvm@vger.kernel.org +S: Maintained +F: drivers/vfio/pci/hisilicon/ + VFIO MEDIATED DEVICE DRIVERS M: Kirti Wankhede <kwankhede@nvidia.com> L: kvm@vger.kernel.org @@ -20408,12 +20415,28 @@ F: drivers/vfio/mdev/ F: include/linux/mdev.h F: samples/vfio-mdev/ +VFIO PCI DEVICE SPECIFIC DRIVERS +R: Jason Gunthorpe <jgg@nvidia.com> +R: Yishai Hadas <yishaih@nvidia.com> +R: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> +R: Kevin Tian <kevin.tian@intel.com> +L: kvm@vger.kernel.org +S: Maintained +P: Documentation/driver-api/vfio-pci-device-specific-driver-acceptance.rst +F: drivers/vfio/pci/*/ + VFIO PLATFORM DRIVER M: Eric Auger <eric.auger@redhat.com> L: kvm@vger.kernel.org S: Maintained F: drivers/vfio/platform/ +VFIO MLX5 PCI DRIVER +M: Yishai Hadas <yishaih@nvidia.com> +L: kvm@vger.kernel.org +S: Maintained +F: drivers/vfio/pci/mlx5/ + VGA_SWITCHEROO R: Lukas Wunner <lukas@wunner.de> S: Maintained diff --git a/drivers/crypto/hisilicon/hpre/hpre.h b/drivers/crypto/hisilicon/hpre/hpre.h index e0b4a1982ee9..9a0558ed82f9 100644 --- a/drivers/crypto/hisilicon/hpre/hpre.h +++ b/drivers/crypto/hisilicon/hpre/hpre.h @@ -4,7 +4,7 @@ #define __HISI_HPRE_H #include <linux/list.h> -#include "../qm.h" +#include <linux/hisi_acc_qm.h> #define HPRE_SQE_SIZE sizeof(struct hpre_sqe) #define HPRE_PF_DEF_Q_NUM 64 diff --git a/drivers/crypto/hisilicon/hpre/hpre_main.c b/drivers/crypto/hisilicon/hpre/hpre_main.c index ebfab3e14499..36ab30e9e654 100644 --- a/drivers/crypto/hisilicon/hpre/hpre_main.c +++ b/drivers/crypto/hisilicon/hpre/hpre_main.c @@ -68,8 +68,7 @@ #define HPRE_REG_RD_INTVRL_US 10 #define HPRE_REG_RD_TMOUT_US 1000 #define HPRE_DBGFS_VAL_MAX_LEN 20 -#define HPRE_PCI_DEVICE_ID 0xa258 -#define HPRE_PCI_VF_DEVICE_ID 0xa259 +#define PCI_DEVICE_ID_HUAWEI_HPRE_PF 0xa258 #define HPRE_QM_USR_CFG_MASK GENMASK(31, 1) #define HPRE_QM_AXI_CFG_MASK GENMASK(15, 0) #define HPRE_QM_VFG_AX_MASK GENMASK(7, 0) @@ -111,8 +110,8 @@ static const char hpre_name[] = "hisi_hpre"; static struct dentry *hpre_debugfs_root; static const struct pci_device_id hpre_dev_ids[] = { - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, HPRE_PCI_DEVICE_ID) }, - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, HPRE_PCI_VF_DEVICE_ID) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_HPRE_PF) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_HPRE_VF) }, { 0, } }; @@ -242,7 +241,7 @@ MODULE_PARM_DESC(uacce_mode, UACCE_MODE_DESC); static int pf_q_num_set(const char *val, const struct kernel_param *kp) { - return q_num_set(val, kp, HPRE_PCI_DEVICE_ID); + return q_num_set(val, kp, PCI_DEVICE_ID_HUAWEI_HPRE_PF); } static const struct kernel_param_ops hpre_pf_q_num_ops = { @@ -921,7 +920,7 @@ static int hpre_debugfs_init(struct hisi_qm *qm) qm->debug.sqe_mask_len = HPRE_SQE_MASK_LEN; hisi_qm_debug_init(qm); - if (qm->pdev->device == HPRE_PCI_DEVICE_ID) { + if (qm->pdev->device == PCI_DEVICE_ID_HUAWEI_HPRE_PF) { ret = hpre_ctrl_debug_init(qm); if (ret) goto failed_to_create; @@ -958,7 +957,7 @@ static int hpre_qm_init(struct hisi_qm *qm, struct pci_dev *pdev) qm->sqe_size = HPRE_SQE_SIZE; qm->dev_name = hpre_name; - qm->fun_type = (pdev->device == HPRE_PCI_DEVICE_ID) ? + qm->fun_type = (pdev->device == PCI_DEVICE_ID_HUAWEI_HPRE_PF) ? QM_HW_PF : QM_HW_VF; if (qm->fun_type == QM_HW_PF) { qm->qp_base = HPRE_PF_DEF_Q_BASE; @@ -1191,6 +1190,12 @@ static struct pci_driver hpre_pci_driver = { .driver.pm = &hpre_pm_ops, }; +struct pci_driver *hisi_hpre_get_pf_driver(void) +{ + return &hpre_pci_driver; +} +EXPORT_SYMBOL_GPL(hisi_hpre_get_pf_driver); + static void hpre_register_debugfs(void) { if (!debugfs_initialized()) diff --git a/drivers/crypto/hisilicon/qm.c b/drivers/crypto/hisilicon/qm.c index 453390044181..009132333d2b 100644 --- a/drivers/crypto/hisilicon/qm.c +++ b/drivers/crypto/hisilicon/qm.c @@ -15,7 +15,7 @@ #include <linux/uacce.h> #include <linux/uaccess.h> #include <uapi/misc/uacce/hisi_qm.h> -#include "qm.h" +#include <linux/hisi_acc_qm.h> /* eq/aeq irq enable */ #define QM_VF_AEQ_INT_SOURCE 0x0 @@ -33,23 +33,6 @@ #define QM_ABNORMAL_EVENT_IRQ_VECTOR 3 /* mailbox */ -#define QM_MB_CMD_SQC 0x0 -#define QM_MB_CMD_CQC 0x1 -#define QM_MB_CMD_EQC 0x2 -#define QM_MB_CMD_AEQC 0x3 -#define QM_MB_CMD_SQC_BT 0x4 -#define QM_MB_CMD_CQC_BT 0x5 -#define QM_MB_CMD_SQC_VFT_V2 0x6 -#define QM_MB_CMD_STOP_QP 0x8 -#define QM_MB_CMD_SRC 0xc -#define QM_MB_CMD_DST 0xd - -#define QM_MB_CMD_SEND_BASE 0x300 -#define QM_MB_EVENT_SHIFT 8 -#define QM_MB_BUSY_SHIFT 13 -#define QM_MB_OP_SHIFT 14 -#define QM_MB_CMD_DATA_ADDR_L 0x304 -#define QM_MB_CMD_DATA_ADDR_H 0x308 #define QM_MB_PING_ALL_VFS 0xffff #define QM_MB_CMD_DATA_SHIFT 32 #define QM_MB_CMD_DATA_MASK GENMASK(31, 0) @@ -103,19 +86,12 @@ #define QM_DB_CMD_SHIFT_V1 16 #define QM_DB_INDEX_SHIFT_V1 32 #define QM_DB_PRIORITY_SHIFT_V1 48 -#define QM_DOORBELL_SQ_CQ_BASE_V2 0x1000 -#define QM_DOORBELL_EQ_AEQ_BASE_V2 0x2000 #define QM_QUE_ISO_CFG_V 0x0030 #define QM_PAGE_SIZE 0x0034 #define QM_QUE_ISO_EN 0x100154 #define QM_CAPBILITY 0x100158 #define QM_QP_NUN_MASK GENMASK(10, 0) #define QM_QP_DB_INTERVAL 0x10000 -#define QM_QP_MAX_NUM_SHIFT 11 -#define QM_DB_CMD_SHIFT_V2 12 -#define QM_DB_RAND_SHIFT_V2 16 -#define QM_DB_INDEX_SHIFT_V2 32 -#define QM_DB_PRIORITY_SHIFT_V2 48 #define QM_MEM_START_INIT 0x100040 #define QM_MEM_INIT_DONE 0x100044 @@ -693,7 +669,7 @@ static void qm_mb_pre_init(struct qm_mailbox *mailbox, u8 cmd, } /* return 0 mailbox ready, -ETIMEDOUT hardware timeout */ -static int qm_wait_mb_ready(struct hisi_qm *qm) +int hisi_qm_wait_mb_ready(struct hisi_qm *qm) { u32 val; @@ -701,6 +677,7 @@ static int qm_wait_mb_ready(struct hisi_qm *qm) val, !((val >> QM_MB_BUSY_SHIFT) & 0x1), POLL_PERIOD, POLL_TIMEOUT); } +EXPORT_SYMBOL_GPL(hisi_qm_wait_mb_ready); /* 128 bit should be written to hardware at one time to trigger a mailbox */ static void qm_mb_write(struct hisi_qm *qm, const void *src) @@ -726,14 +703,14 @@ static void qm_mb_write(struct hisi_qm *qm, const void *src) static int qm_mb_nolock(struct hisi_qm *qm, struct qm_mailbox *mailbox) { - if (unlikely(qm_wait_mb_ready(qm))) { + if (unlikely(hisi_qm_wait_mb_ready(qm))) { dev_err(&qm->pdev->dev, "QM mailbox is busy to start!\n"); goto mb_busy; } qm_mb_write(qm, mailbox); - if (unlikely(qm_wait_mb_ready(qm))) { + if (unlikely(hisi_qm_wait_mb_ready(qm))) { dev_err(&qm->pdev->dev, "QM mailbox operation timeout!\n"); goto mb_busy; } @@ -745,8 +722,8 @@ mb_busy: return -EBUSY; } -static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue, - bool op) +int hisi_qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue, + bool op) { struct qm_mailbox mailbox; int ret; @@ -762,6 +739,7 @@ static int qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue, return ret; } +EXPORT_SYMBOL_GPL(hisi_qm_mb); static void qm_db_v1(struct hisi_qm *qm, u16 qn, u8 cmd, u16 index, u8 priority) { @@ -1351,7 +1329,7 @@ static int qm_get_vft_v2(struct hisi_qm *qm, u32 *base, u32 *number) u64 sqc_vft; int ret; - ret = qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1); + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1); if (ret) return ret; @@ -1725,12 +1703,12 @@ static int dump_show(struct hisi_qm *qm, void *info, static int qm_dump_sqc_raw(struct hisi_qm *qm, dma_addr_t dma_addr, u16 qp_id) { - return qm_mb(qm, QM_MB_CMD_SQC, dma_addr, qp_id, 1); + return hisi_qm_mb(qm, QM_MB_CMD_SQC, dma_addr, qp_id, 1); } static int qm_dump_cqc_raw(struct hisi_qm *qm, dma_addr_t dma_addr, u16 qp_id) { - return qm_mb(qm, QM_MB_CMD_CQC, dma_addr, qp_id, 1); + return hisi_qm_mb(qm, QM_MB_CMD_CQC, dma_addr, qp_id, 1); } static int qm_sqc_dump(struct hisi_qm *qm, const char *s) @@ -1842,7 +1820,7 @@ static int qm_eqc_aeqc_dump(struct hisi_qm *qm, char *s, size_t size, if (IS_ERR(xeqc)) return PTR_ERR(xeqc); - ret = qm_mb(qm, cmd, xeqc_dma, 0, 1); + ret = hisi_qm_mb(qm, cmd, xeqc_dma, 0, 1); if (ret) goto err_free_ctx; @@ -2495,7 +2473,7 @@ unlock: static int qm_stop_qp(struct hisi_qp *qp) { - return qm_mb(qp->qm, QM_MB_CMD_STOP_QP, 0, qp->qp_id, 0); + return hisi_qm_mb(qp->qm, QM_MB_CMD_STOP_QP, 0, qp->qp_id, 0); } static int qm_set_msi(struct hisi_qm *qm, bool set) @@ -2763,7 +2741,7 @@ static int qm_sq_ctx_cfg(struct hisi_qp *qp, int qp_id, u32 pasid) return -ENOMEM; } - ret = qm_mb(qm, QM_MB_CMD_SQC, sqc_dma, qp_id, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC, sqc_dma, qp_id, 0); dma_unmap_single(dev, sqc_dma, sizeof(struct qm_sqc), DMA_TO_DEVICE); kfree(sqc); @@ -2804,7 +2782,7 @@ static int qm_cq_ctx_cfg(struct hisi_qp *qp, int qp_id, u32 pasid) return -ENOMEM; } - ret = qm_mb(qm, QM_MB_CMD_CQC, cqc_dma, qp_id, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_CQC, cqc_dma, qp_id, 0); dma_unmap_single(dev, cqc_dma, sizeof(struct qm_cqc), DMA_TO_DEVICE); kfree(cqc); @@ -3514,6 +3492,12 @@ static void hisi_qm_pci_uninit(struct hisi_qm *qm) pci_disable_device(pdev); } +static void hisi_qm_set_state(struct hisi_qm *qm, u8 state) +{ + if (qm->ver > QM_HW_V2 && qm->fun_type == QM_HW_VF) + writel(state, qm->io_base + QM_VF_STATE); +} + /** * hisi_qm_uninit() - Uninitialize qm. * @qm: The qm needed uninit. @@ -3542,6 +3526,7 @@ void hisi_qm_uninit(struct hisi_qm *qm) dma_free_coherent(dev, qm->qdma.size, qm->qdma.va, qm->qdma.dma); } + hisi_qm_set_state(qm, QM_NOT_READY); up_write(&qm->qps_lock); qm_irq_unregister(qm); @@ -3655,7 +3640,7 @@ static int qm_eq_ctx_cfg(struct hisi_qm *qm) return -ENOMEM; } - ret = qm_mb(qm, QM_MB_CMD_EQC, eqc_dma, 0, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_EQC, eqc_dma, 0, 0); dma_unmap_single(dev, eqc_dma, sizeof(struct qm_eqc), DMA_TO_DEVICE); kfree(eqc); @@ -3684,7 +3669,7 @@ static int qm_aeq_ctx_cfg(struct hisi_qm *qm) return -ENOMEM; } - ret = qm_mb(qm, QM_MB_CMD_AEQC, aeqc_dma, 0, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_AEQC, aeqc_dma, 0, 0); dma_unmap_single(dev, aeqc_dma, sizeof(struct qm_aeqc), DMA_TO_DEVICE); kfree(aeqc); @@ -3723,11 +3708,11 @@ static int __hisi_qm_start(struct hisi_qm *qm) if (ret) return ret; - ret = qm_mb(qm, QM_MB_CMD_SQC_BT, qm->sqc_dma, 0, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_BT, qm->sqc_dma, 0, 0); if (ret) return ret; - ret = qm_mb(qm, QM_MB_CMD_CQC_BT, qm->cqc_dma, 0, 0); + ret = hisi_qm_mb(qm, QM_MB_CMD_CQC_BT, qm->cqc_dma, 0, 0); if (ret) return ret; @@ -3767,6 +3752,7 @@ int hisi_qm_start(struct hisi_qm *qm) if (!ret) atomic_set(&qm->status.flags, QM_START); + hisi_qm_set_state(qm, QM_READY); err_unlock: up_write(&qm->qps_lock); return ret; diff --git a/drivers/crypto/hisilicon/sec2/sec.h b/drivers/crypto/hisilicon/sec2/sec.h index d97cf02b1df7..c2e9b01187a7 100644 --- a/drivers/crypto/hisilicon/sec2/sec.h +++ b/drivers/crypto/hisilicon/sec2/sec.h @@ -4,7 +4,7 @@ #ifndef __HISI_SEC_V2_H #define __HISI_SEC_V2_H -#include "../qm.h" +#include <linux/hisi_acc_qm.h> #include "sec_crypto.h" /* Algorithm resource per hardware SEC queue */ diff --git a/drivers/crypto/hisilicon/sec2/sec_main.c b/drivers/crypto/hisilicon/sec2/sec_main.c index 0b9906ff69e3..92fae706bdb2 100644 --- a/drivers/crypto/hisilicon/sec2/sec_main.c +++ b/drivers/crypto/hisilicon/sec2/sec_main.c @@ -20,8 +20,7 @@ #define SEC_VF_NUM 63 #define SEC_QUEUE_NUM_V1 4096 -#define SEC_PF_PCI_DEVICE_ID 0xa255 -#define SEC_VF_PCI_DEVICE_ID 0xa256 +#define PCI_DEVICE_ID_HUAWEI_SEC_PF 0xa255 #define SEC_BD_ERR_CHK_EN0 0xEFFFFFFF #define SEC_BD_ERR_CHK_EN1 0x7ffff7fd @@ -229,7 +228,7 @@ static const struct debugfs_reg32 sec_dfx_regs[] = { static int sec_pf_q_num_set(const char *val, const struct kernel_param *kp) { - return q_num_set(val, kp, SEC_PF_PCI_DEVICE_ID); + return q_num_set(val, kp, PCI_DEVICE_ID_HUAWEI_SEC_PF); } static const struct kernel_param_ops sec_pf_q_num_ops = { @@ -317,8 +316,8 @@ module_param_cb(uacce_mode, &sec_uacce_mode_ops, &uacce_mode, 0444); MODULE_PARM_DESC(uacce_mode, UACCE_MODE_DESC); static const struct pci_device_id sec_dev_ids[] = { - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, SEC_PF_PCI_DEVICE_ID) }, - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, SEC_VF_PCI_DEVICE_ID) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_SEC_PF) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_SEC_VF) }, { 0, } }; MODULE_DEVICE_TABLE(pci, sec_dev_ids); @@ -748,7 +747,7 @@ static int sec_core_debug_init(struct hisi_qm *qm) regset->base = qm->io_base; regset->dev = dev; - if (qm->pdev->device == SEC_PF_PCI_DEVICE_ID) + if (qm->pdev->device == PCI_DEVICE_ID_HUAWEI_SEC_PF) debugfs_create_file("regs", 0444, tmp_d, regset, &sec_regs_fops); for (i = 0; i < ARRAY_SIZE(sec_dfx_labels); i++) { @@ -766,7 +765,7 @@ static int sec_debug_init(struct hisi_qm *qm) struct sec_dev *sec = container_of(qm, struct sec_dev, qm); int i; - if (qm->pdev->device == SEC_PF_PCI_DEVICE_ID) { + if (qm->pdev->device == PCI_DEVICE_ID_HUAWEI_SEC_PF) { for (i = SEC_CLEAR_ENABLE; i < SEC_DEBUG_FILE_NUM; i++) { spin_lock_init(&sec->debug.files[i].lock); sec->debug.files[i].index = i; @@ -908,7 +907,7 @@ static int sec_qm_init(struct hisi_qm *qm, struct pci_dev *pdev) qm->sqe_size = SEC_SQE_SIZE; qm->dev_name = sec_name; - qm->fun_type = (pdev->device == SEC_PF_PCI_DEVICE_ID) ? + qm->fun_type = (pdev->device == PCI_DEVICE_ID_HUAWEI_SEC_PF) ? QM_HW_PF : QM_HW_VF; if (qm->fun_type == QM_HW_PF) { qm->qp_base = SEC_PF_DEF_Q_BASE; @@ -1120,6 +1119,12 @@ static struct pci_driver sec_pci_driver = { .driver.pm = &sec_pm_ops, }; +struct pci_driver *hisi_sec_get_pf_driver(void) +{ + return &sec_pci_driver; +} +EXPORT_SYMBOL_GPL(hisi_sec_get_pf_driver); + static void sec_register_debugfs(void) { if (!debugfs_initialized()) diff --git a/drivers/crypto/hisilicon/sgl.c b/drivers/crypto/hisilicon/sgl.c index 057273769f26..f7efc02b065f 100644 --- a/drivers/crypto/hisilicon/sgl.c +++ b/drivers/crypto/hisilicon/sgl.c @@ -1,9 +1,9 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 HiSilicon Limited. */ #include <linux/dma-mapping.h> +#include <linux/hisi_acc_qm.h> #include <linux/module.h> #include <linux/slab.h> -#include "qm.h" #define HISI_ACC_SGL_SGE_NR_MIN 1 #define HISI_ACC_SGL_NR_MAX 256 diff --git a/drivers/crypto/hisilicon/zip/zip.h b/drivers/crypto/hisilicon/zip/zip.h index 517fdbdff3ea..3dfd3bac5a33 100644 --- a/drivers/crypto/hisilicon/zip/zip.h +++ b/drivers/crypto/hisilicon/zip/zip.h @@ -7,7 +7,7 @@ #define pr_fmt(fmt) "hisi_zip: " fmt #include <linux/list.h> -#include "../qm.h" +#include <linux/hisi_acc_qm.h> enum hisi_zip_error_type { /* negative compression */ diff --git a/drivers/crypto/hisilicon/zip/zip_main.c b/drivers/crypto/hisilicon/zip/zip_main.c index 678f8b58ec42..4534e1e107d1 100644 --- a/drivers/crypto/hisilicon/zip/zip_main.c +++ b/drivers/crypto/hisilicon/zip/zip_main.c @@ -15,8 +15,7 @@ #include <linux/uacce.h> #include "zip.h" -#define PCI_DEVICE_ID_ZIP_PF 0xa250 -#define PCI_DEVICE_ID_ZIP_VF 0xa251 +#define PCI_DEVICE_ID_HUAWEI_ZIP_PF 0xa250 #define HZIP_QUEUE_NUM_V1 4096 @@ -246,7 +245,7 @@ MODULE_PARM_DESC(uacce_mode, UACCE_MODE_DESC); static int pf_q_num_set(const char *val, const struct kernel_param *kp) { - return q_num_set(val, kp, PCI_DEVICE_ID_ZIP_PF); + return q_num_set(val, kp, PCI_DEVICE_ID_HUAWEI_ZIP_PF); } static const struct kernel_param_ops pf_q_num_ops = { @@ -268,8 +267,8 @@ module_param_cb(vfs_num, &vfs_num_ops, &vfs_num, 0444); MODULE_PARM_DESC(vfs_num, "Number of VFs to enable(1-63), 0(default)"); static const struct pci_device_id hisi_zip_dev_ids[] = { - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_ZIP_PF) }, - { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_ZIP_VF) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_ZIP_PF) }, + { PCI_DEVICE(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_ZIP_VF) }, { 0, } }; MODULE_DEVICE_TABLE(pci, hisi_zip_dev_ids); @@ -838,7 +837,7 @@ static int hisi_zip_qm_init(struct hisi_qm *qm, struct pci_dev *pdev) qm->sqe_size = HZIP_SQE_SIZE; qm->dev_name = hisi_zip_name; - qm->fun_type = (pdev->device == PCI_DEVICE_ID_ZIP_PF) ? + qm->fun_type = (pdev->device == PCI_DEVICE_ID_HUAWEI_ZIP_PF) ? QM_HW_PF : QM_HW_VF; if (qm->fun_type == QM_HW_PF) { qm->qp_base = HZIP_PF_DEF_Q_BASE; @@ -1013,6 +1012,12 @@ static struct pci_driver hisi_zip_pci_driver = { .driver.pm = &hisi_zip_pm_ops, }; +struct pci_driver *hisi_zip_get_pf_driver(void) +{ + return &hisi_zip_pci_driver; +} +EXPORT_SYMBOL_GPL(hisi_zip_get_pf_driver); + static void hisi_zip_register_debugfs(void) { if (!debugfs_initialized()) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c index 3eacd8739929..fc19a0762af2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c @@ -478,6 +478,11 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op, case MLX5_CMD_OP_QUERY_VHCA_STATE: case MLX5_CMD_OP_MODIFY_VHCA_STATE: case MLX5_CMD_OP_ALLOC_SF: + case MLX5_CMD_OP_SUSPEND_VHCA: + case MLX5_CMD_OP_RESUME_VHCA: + case MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE: + case MLX5_CMD_OP_SAVE_VHCA_STATE: + case MLX5_CMD_OP_LOAD_VHCA_STATE: *status = MLX5_DRIVER_STATUS_ABORTED; *synd = MLX5_DRIVER_SYND; return -EIO; @@ -675,6 +680,11 @@ const char *mlx5_command_str(int command) MLX5_COMMAND_STR_CASE(MODIFY_VHCA_STATE); MLX5_COMMAND_STR_CASE(ALLOC_SF); MLX5_COMMAND_STR_CASE(DEALLOC_SF); + MLX5_COMMAND_STR_CASE(SUSPEND_VHCA); + MLX5_COMMAND_STR_CASE(RESUME_VHCA); + MLX5_COMMAND_STR_CASE(QUERY_VHCA_MIGRATION_STATE); + MLX5_COMMAND_STR_CASE(SAVE_VHCA_STATE); + MLX5_COMMAND_STR_CASE(LOAD_VHCA_STATE); default: return "unknown command opcode"; } } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index bba72b220cc3..976578d1904c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1620,6 +1620,7 @@ static void remove_one(struct pci_dev *pdev) struct devlink *devlink = priv_to_devlink(dev); devlink_unregister(devlink); + mlx5_sriov_disable(pdev); mlx5_crdump_disable(dev); mlx5_drain_health_wq(dev); mlx5_uninit_one(dev); @@ -1882,6 +1883,50 @@ static struct pci_driver mlx5_core_driver = { .sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count, }; +/** + * mlx5_vf_get_core_dev - Get the mlx5 core device from a given VF PCI device if + * mlx5_core is its driver. + * @pdev: The associated PCI device. + * + * Upon return the interface state lock stay held to let caller uses it safely. + * Caller must ensure to use the returned mlx5 device for a narrow window + * and put it back with mlx5_vf_put_core_dev() immediately once usage was over. + * + * Return: Pointer to the associated mlx5_core_dev or NULL. + */ +struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev) + __acquires(&mdev->intf_state_mutex) +{ + struct mlx5_core_dev *mdev; + + mdev = pci_iov_get_pf_drvdata(pdev, &mlx5_core_driver); + if (IS_ERR(mdev)) + return NULL; + + mutex_lock(&mdev->intf_state_mutex); + if (!test_bit(MLX5_INTERFACE_STATE_UP, &mdev->intf_state)) { + mutex_unlock(&mdev->intf_state_mutex); + return NULL; + } + + return mdev; +} +EXPORT_SYMBOL(mlx5_vf_get_core_dev); + +/** + * mlx5_vf_put_core_dev - Put the mlx5 core device back. + * @mdev: The mlx5 core device. + * + * Upon return the interface state lock is unlocked and caller should not + * access the mdev any more. + */ +void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev) + __releases(&mdev->intf_state_mutex) +{ + mutex_unlock(&mdev->intf_state_mutex); +} +EXPORT_SYMBOL(mlx5_vf_put_core_dev); + static void mlx5_core_verify_params(void) { if (prof_sel >= ARRAY_SIZE(profile)) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h index 6f8baa0f2a73..37b2805b3bf3 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h @@ -164,6 +164,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev); int mlx5_sriov_attach(struct mlx5_core_dev *dev); void mlx5_sriov_detach(struct mlx5_core_dev *dev); int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs); +void mlx5_sriov_disable(struct pci_dev *pdev); int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count); int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id); int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c index e8185b69ac6c..887ee0f729d1 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c @@ -161,7 +161,7 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs) return err; } -static void mlx5_sriov_disable(struct pci_dev *pdev) +void mlx5_sriov_disable(struct pci_dev *pdev) { struct mlx5_core_dev *dev = pci_get_drvdata(pdev); int num_vfs = pci_num_vf(dev->pdev); @@ -205,19 +205,8 @@ int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count) mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf)); sriov = &dev->priv.sriov; - - /* Reversed translation of PCI VF function number to the internal - * function_id, which exists in the name of virtfn symlink. - */ - for (id = 0; id < pci_num_vf(pf); id++) { - if (!sriov->vfs_ctx[id].enabled) - continue; - - if (vf->devfn == pci_iov_virtfn_devfn(pf, id)) - break; - } - - if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled) + id = pci_iov_vf_id(vf); + if (id < 0 || !sriov->vfs_ctx[id].enabled) return -EINVAL; return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count); diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0267977c9f17..952217572113 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -33,6 +33,49 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) } EXPORT_SYMBOL_GPL(pci_iov_virtfn_devfn); +int pci_iov_vf_id(struct pci_dev *dev) +{ + struct pci_dev *pf; + + if (!dev->is_virtfn) + return -EINVAL; + + pf = pci_physfn(dev); + return (((dev->bus->number << 8) + dev->devfn) - + ((pf->bus->number << 8) + pf->devfn + pf->sriov->offset)) / + pf->sriov->stride; +} +EXPORT_SYMBOL_GPL(pci_iov_vf_id); + +/** + * pci_iov_get_pf_drvdata - Return the drvdata of a PF + * @dev: VF pci_dev + * @pf_driver: Device driver required to own the PF + * + * This must be called from a context that ensures that a VF driver is attached. + * The value returned is invalid once the VF driver completes its remove() + * callback. + * + * Locking is achieved by the driver core. A VF driver cannot be probed until + * pci_enable_sriov() is called and pci_disable_sriov() does not return until + * all VF drivers have completed their remove(). + * + * The PF driver must call pci_disable_sriov() before it begins to destroy the + * drvdata. + */ +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver) +{ + struct pci_dev *pf_dev; + + if (!dev->is_virtfn) + return ERR_PTR(-EINVAL); + pf_dev = dev->physfn; + if (pf_dev->driver != pf_driver) + return ERR_PTR(-EINVAL); + return pci_get_drvdata(pf_dev); +} +EXPORT_SYMBOL_GPL(pci_iov_get_pf_drvdata); + /* * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may * change when NumVFs changes. diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig index 860424ccda1b..4da1914425e1 100644 --- a/drivers/vfio/pci/Kconfig +++ b/drivers/vfio/pci/Kconfig @@ -43,4 +43,9 @@ config VFIO_PCI_IGD To enable Intel IGD assignment through vfio-pci, say Y. endif + +source "drivers/vfio/pci/mlx5/Kconfig" + +source "drivers/vfio/pci/hisilicon/Kconfig" + endif diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile index 349d68d242b4..7052ebd893e0 100644 --- a/drivers/vfio/pci/Makefile +++ b/drivers/vfio/pci/Makefile @@ -7,3 +7,7 @@ obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o vfio-pci-y := vfio_pci.o vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o obj-$(CONFIG_VFIO_PCI) += vfio-pci.o + +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/ + +obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/ diff --git a/drivers/vfio/pci/hisilicon/Kconfig b/drivers/vfio/pci/hisilicon/Kconfig new file mode 100644 index 000000000000..5daa0f45d2f9 --- /dev/null +++ b/drivers/vfio/pci/hisilicon/Kconfig @@ -0,0 +1,15 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_ACC_VFIO_PCI + tristate "VFIO PCI support for HiSilicon ACC devices" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on VFIO_PCI_CORE + depends on PCI_MSI + depends on CRYPTO_DEV_HISI_QM + depends on CRYPTO_DEV_HISI_HPRE + depends on CRYPTO_DEV_HISI_SEC2 + depends on CRYPTO_DEV_HISI_ZIP + help + This provides generic PCI support for HiSilicon ACC devices + using the VFIO framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/hisilicon/Makefile b/drivers/vfio/pci/hisilicon/Makefile new file mode 100644 index 000000000000..c66b3783f2f9 --- /dev/null +++ b/drivers/vfio/pci/hisilicon/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisi-acc-vfio-pci.o +hisi-acc-vfio-pci-y := hisi_acc_vfio_pci.o + diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c new file mode 100644 index 000000000000..767b5d47631a --- /dev/null +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c @@ -0,0 +1,1326 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2021, HiSilicon Ltd. + */ + +#include <linux/device.h> +#include <linux/eventfd.h> +#include <linux/file.h> +#include <linux/hisi_acc_qm.h> +#include <linux/interrupt.h> +#include <linux/module.h> +#include <linux/pci.h> +#include <linux/vfio.h> +#include <linux/vfio_pci_core.h> +#include <linux/anon_inodes.h> + +#include "hisi_acc_vfio_pci.h" + +/* return 0 on VM acc device ready, -ETIMEDOUT hardware timeout */ +static int qm_wait_dev_not_ready(struct hisi_qm *qm) +{ + u32 val; + + return readl_relaxed_poll_timeout(qm->io_base + QM_VF_STATE, + val, !(val & 0x1), MB_POLL_PERIOD_US, + MB_POLL_TIMEOUT_US); +} + +/* + * Each state Reg is checked 100 times, + * with a delay of 100 microseconds after each check + */ +static u32 qm_check_reg_state(struct hisi_qm *qm, u32 regs) +{ + int check_times = 0; + u32 state; + + state = readl(qm->io_base + regs); + while (state && check_times < ERROR_CHECK_TIMEOUT) { + udelay(CHECK_DELAY_TIME); + state = readl(qm->io_base + regs); + check_times++; + } + + return state; +} + +static int qm_read_regs(struct hisi_qm *qm, u32 reg_addr, + u32 *data, u8 nums) +{ + int i; + + if (nums < 1 || nums > QM_REGS_MAX_LEN) + return -EINVAL; + + for (i = 0; i < nums; i++) { + data[i] = readl(qm->io_base + reg_addr); + reg_addr += QM_REG_ADDR_OFFSET; + } + + return 0; +} + +static int qm_write_regs(struct hisi_qm *qm, u32 reg, + u32 *data, u8 nums) +{ + int i; + + if (nums < 1 || nums > QM_REGS_MAX_LEN) + return -EINVAL; + + for (i = 0; i < nums; i++) + writel(data[i], qm->io_base + reg + i * QM_REG_ADDR_OFFSET); + + return 0; +} + +static int qm_get_vft(struct hisi_qm *qm, u32 *base) +{ + u64 sqc_vft; + u32 qp_num; + int ret; + + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_VFT_V2, 0, 0, 1); + if (ret) + return ret; + + sqc_vft = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) | + ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << + QM_XQC_ADDR_OFFSET); + *base = QM_SQC_VFT_BASE_MASK_V2 & (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2); + qp_num = (QM_SQC_VFT_NUM_MASK_V2 & + (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1; + + return qp_num; +} + +static int qm_get_sqc(struct hisi_qm *qm, u64 *addr) +{ + int ret; + + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_BT, 0, 0, 1); + if (ret) + return ret; + + *addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) | + ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << + QM_XQC_ADDR_OFFSET); + + return 0; +} + +static int qm_get_cqc(struct hisi_qm *qm, u64 *addr) +{ + int ret; + + ret = hisi_qm_mb(qm, QM_MB_CMD_CQC_BT, 0, 0, 1); + if (ret) + return ret; + + *addr = readl(qm->io_base + QM_MB_CMD_DATA_ADDR_L) | + ((u64)readl(qm->io_base + QM_MB_CMD_DATA_ADDR_H) << + QM_XQC_ADDR_OFFSET); + + return 0; +} + +static int qm_get_regs(struct hisi_qm *qm, struct acc_vf_data *vf_data) +{ + struct device *dev = &qm->pdev->dev; + int ret; + + ret = qm_read_regs(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1); + if (ret) { + dev_err(dev, "failed to read QM_VF_AEQ_INT_MASK\n"); + return ret; + } + + ret = qm_read_regs(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1); + if (ret) { + dev_err(dev, "failed to read QM_VF_EQ_INT_MASK\n"); + return ret; + } + + ret = qm_read_regs(qm, QM_IFC_INT_SOURCE_V, + &vf_data->ifc_int_source, 1); + if (ret) { + dev_err(dev, "failed to read QM_IFC_INT_SOURCE_V\n"); + return ret; + } + + ret = qm_read_regs(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1); + if (ret) { + dev_err(dev, "failed to read QM_IFC_INT_MASK\n"); + return ret; + } + + ret = qm_read_regs(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1); + if (ret) { + dev_err(dev, "failed to read QM_IFC_INT_SET_V\n"); + return ret; + } + + ret = qm_read_regs(qm, QM_PAGE_SIZE, &vf_data->page_size, 1); + if (ret) { + dev_err(dev, "failed to read QM_PAGE_SIZE\n"); + return ret; + } + + /* QM_EQC_DW has 7 regs */ + ret = qm_read_regs(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7); + if (ret) { + dev_err(dev, "failed to read QM_EQC_DW\n"); + return ret; + } + + /* QM_AEQC_DW has 7 regs */ + ret = qm_read_regs(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7); + if (ret) { + dev_err(dev, "failed to read QM_AEQC_DW\n"); + return ret; + } + + return 0; +} + +static int qm_set_regs(struct hisi_qm *qm, struct acc_vf_data *vf_data) +{ + struct device *dev = &qm->pdev->dev; + int ret; + + /* check VF state */ + if (unlikely(hisi_qm_wait_mb_ready(qm))) { + dev_err(&qm->pdev->dev, "QM device is not ready to write\n"); + return -EBUSY; + } + + ret = qm_write_regs(qm, QM_VF_AEQ_INT_MASK, &vf_data->aeq_int_mask, 1); + if (ret) { + dev_err(dev, "failed to write QM_VF_AEQ_INT_MASK\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_VF_EQ_INT_MASK, &vf_data->eq_int_mask, 1); + if (ret) { + dev_err(dev, "failed to write QM_VF_EQ_INT_MASK\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_IFC_INT_SOURCE_V, + &vf_data->ifc_int_source, 1); + if (ret) { + dev_err(dev, "failed to write QM_IFC_INT_SOURCE_V\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_IFC_INT_MASK, &vf_data->ifc_int_mask, 1); + if (ret) { + dev_err(dev, "failed to write QM_IFC_INT_MASK\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_IFC_INT_SET_V, &vf_data->ifc_int_set, 1); + if (ret) { + dev_err(dev, "failed to write QM_IFC_INT_SET_V\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1); + if (ret) { + dev_err(dev, "failed to write QM_QUE_ISO_CFG_V\n"); + return ret; + } + + ret = qm_write_regs(qm, QM_PAGE_SIZE, &vf_data->page_size, 1); + if (ret) { + dev_err(dev, "failed to write QM_PAGE_SIZE\n"); + return ret; + } + + /* QM_EQC_DW has 7 regs */ + ret = qm_write_regs(qm, QM_EQC_DW0, vf_data->qm_eqc_dw, 7); + if (ret) { + dev_err(dev, "failed to write QM_EQC_DW\n"); + return ret; + } + + /* QM_AEQC_DW has 7 regs */ + ret = qm_write_regs(qm, QM_AEQC_DW0, vf_data->qm_aeqc_dw, 7); + if (ret) { + dev_err(dev, "failed to write QM_AEQC_DW\n"); + return ret; + } + + return 0; +} + +static void qm_db(struct hisi_qm *qm, u16 qn, u8 cmd, + u16 index, u8 priority) +{ + u64 doorbell; + u64 dbase; + u16 randata = 0; + + if (cmd == QM_DOORBELL_CMD_SQ || cmd == QM_DOORBELL_CMD_CQ) + dbase = QM_DOORBELL_SQ_CQ_BASE_V2; + else + dbase = QM_DOORBELL_EQ_AEQ_BASE_V2; + + doorbell = qn | ((u64)cmd << QM_DB_CMD_SHIFT_V2) | + ((u64)randata << QM_DB_RAND_SHIFT_V2) | + ((u64)index << QM_DB_INDEX_SHIFT_V2) | + ((u64)priority << QM_DB_PRIORITY_SHIFT_V2); + + writeq(doorbell, qm->io_base + dbase); +} + +static int pf_qm_get_qp_num(struct hisi_qm *qm, int vf_id, u32 *rbase) +{ + unsigned int val; + u64 sqc_vft; + u32 qp_num; + int ret; + + ret = readl_relaxed_poll_timeout(qm->io_base + QM_VFT_CFG_RDY, val, + val & BIT(0), MB_POLL_PERIOD_US, + MB_POLL_TIMEOUT_US); + if (ret) + return ret; + + writel(0x1, qm->io_base + QM_VFT_CFG_OP_WR); + /* 0 mean SQC VFT */ + writel(0x0, qm->io_base + QM_VFT_CFG_TYPE); + writel(vf_id, qm->io_base + QM_VFT_CFG); + + writel(0x0, qm->io_base + QM_VFT_CFG_RDY); + writel(0x1, qm->io_base + QM_VFT_CFG_OP_ENABLE); + + ret = readl_relaxed_poll_timeout(qm->io_base + QM_VFT_CFG_RDY, val, + val & BIT(0), MB_POLL_PERIOD_US, + MB_POLL_TIMEOUT_US); + if (ret) + return ret; + + sqc_vft = readl(qm->io_base + QM_VFT_CFG_DATA_L) | + ((u64)readl(qm->io_base + QM_VFT_CFG_DATA_H) << + QM_XQC_ADDR_OFFSET); + *rbase = QM_SQC_VFT_BASE_MASK_V2 & + (sqc_vft >> QM_SQC_VFT_BASE_SHIFT_V2); + qp_num = (QM_SQC_VFT_NUM_MASK_V2 & + (sqc_vft >> QM_SQC_VFT_NUM_SHIFT_V2)) + 1; + + return qp_num; +} + +static void qm_dev_cmd_init(struct hisi_qm *qm) +{ + /* Clear VF communication status registers. */ + writel(0x1, qm->io_base + QM_IFC_INT_SOURCE_V); + + /* Enable pf and vf communication. */ + writel(0x0, qm->io_base + QM_IFC_INT_MASK); +} + +static int vf_qm_cache_wb(struct hisi_qm *qm) +{ + unsigned int val; + + writel(0x1, qm->io_base + QM_CACHE_WB_START); + if (readl_relaxed_poll_timeout(qm->io_base + QM_CACHE_WB_DONE, + val, val & BIT(0), MB_POLL_PERIOD_US, + MB_POLL_TIMEOUT_US)) { + dev_err(&qm->pdev->dev, "vf QM writeback sqc cache fail\n"); + return -EINVAL; + } + + return 0; +} + +static void vf_qm_fun_reset(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct hisi_qm *qm) +{ + int i; + + for (i = 0; i < qm->qp_num; i++) + qm_db(qm, i, QM_DOORBELL_CMD_SQ, 0, 1); +} + +static int vf_qm_func_stop(struct hisi_qm *qm) +{ + return hisi_qm_mb(qm, QM_MB_CMD_PAUSE_QM, 0, 0, 0); +} + +static int vf_qm_check_match(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct hisi_acc_vf_migration_file *migf) +{ + struct acc_vf_data *vf_data = &migf->vf_data; + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + struct hisi_qm *pf_qm = hisi_acc_vdev->pf_qm; + struct device *dev = &vf_qm->pdev->dev; + u32 que_iso_state; + int ret; + + if (migf->total_length < QM_MATCH_SIZE) + return -EINVAL; + + if (vf_data->acc_magic != ACC_DEV_MAGIC) { + dev_err(dev, "failed to match ACC_DEV_MAGIC\n"); + return -EINVAL; + } + + if (vf_data->dev_id != hisi_acc_vdev->vf_dev->device) { + dev_err(dev, "failed to match VF devices\n"); + return -EINVAL; + } + + /* vf qp num check */ + ret = qm_get_vft(vf_qm, &vf_qm->qp_base); + if (ret <= 0) { + dev_err(dev, "failed to get vft qp nums\n"); + return -EINVAL; + } + + if (ret != vf_data->qp_num) { + dev_err(dev, "failed to match VF qp num\n"); + return -EINVAL; + } + + vf_qm->qp_num = ret; + + /* vf isolation state check */ + ret = qm_read_regs(pf_qm, QM_QUE_ISO_CFG_V, &que_iso_state, 1); + if (ret) { + dev_err(dev, "failed to read QM_QUE_ISO_CFG_V\n"); + return ret; + } + + if (vf_data->que_iso_cfg != que_iso_state) { + dev_err(dev, "failed to match isolation state\n"); + return ret; + } + + ret = qm_write_regs(vf_qm, QM_VF_STATE, &vf_data->vf_qm_state, 1); + if (ret) { + dev_err(dev, "failed to write QM_VF_STATE\n"); + return ret; + } + + hisi_acc_vdev->vf_qm_state = vf_data->vf_qm_state; + return 0; +} + +static int vf_qm_get_match_data(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct acc_vf_data *vf_data) +{ + struct hisi_qm *pf_qm = hisi_acc_vdev->pf_qm; + struct device *dev = &pf_qm->pdev->dev; + int vf_id = hisi_acc_vdev->vf_id; + int ret; + + vf_data->acc_magic = ACC_DEV_MAGIC; + /* save device id */ + vf_data->dev_id = hisi_acc_vdev->vf_dev->device; + + /* vf qp num save from PF */ + ret = pf_qm_get_qp_num(pf_qm, vf_id, &vf_data->qp_base); + if (ret <= 0) { + dev_err(dev, "failed to get vft qp nums!\n"); + return -EINVAL; + } + + vf_data->qp_num = ret; + + /* VF isolation state save from PF */ + ret = qm_read_regs(pf_qm, QM_QUE_ISO_CFG_V, &vf_data->que_iso_cfg, 1); + if (ret) { + dev_err(dev, "failed to read QM_QUE_ISO_CFG_V!\n"); + return ret; + } + + return 0; +} + +static int vf_qm_load_data(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct hisi_acc_vf_migration_file *migf) +{ + struct hisi_qm *qm = &hisi_acc_vdev->vf_qm; + struct device *dev = &qm->pdev->dev; + struct acc_vf_data *vf_data = &migf->vf_data; + int ret; + + /* Return if only match data was transferred */ + if (migf->total_length == QM_MATCH_SIZE) + return 0; + + if (migf->total_length < sizeof(struct acc_vf_data)) + return -EINVAL; + + qm->eqe_dma = vf_data->eqe_dma; + qm->aeqe_dma = vf_data->aeqe_dma; + qm->sqc_dma = vf_data->sqc_dma; + qm->cqc_dma = vf_data->cqc_dma; + + qm->qp_base = vf_data->qp_base; + qm->qp_num = vf_data->qp_num; + + ret = qm_set_regs(qm, vf_data); + if (ret) { + dev_err(dev, "Set VF regs failed\n"); + return ret; + } + + ret = hisi_qm_mb(qm, QM_MB_CMD_SQC_BT, qm->sqc_dma, 0, 0); + if (ret) { + dev_err(dev, "Set sqc failed\n"); + return ret; + } + + ret = hisi_qm_mb(qm, QM_MB_CMD_CQC_BT, qm->cqc_dma, 0, 0); + if (ret) { + dev_err(dev, "Set cqc failed\n"); + return ret; + } + + qm_dev_cmd_init(qm); + return 0; +} + +static int vf_qm_state_save(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct hisi_acc_vf_migration_file *migf) +{ + struct acc_vf_data *vf_data = &migf->vf_data; + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + struct device *dev = &vf_qm->pdev->dev; + int ret; + + ret = vf_qm_get_match_data(hisi_acc_vdev, vf_data); + if (ret) + return ret; + + if (unlikely(qm_wait_dev_not_ready(vf_qm))) { + /* Update state and return with match data */ + vf_data->vf_qm_state = QM_NOT_READY; + hisi_acc_vdev->vf_qm_state = vf_data->vf_qm_state; + migf->total_length = QM_MATCH_SIZE; + return 0; + } + + vf_data->vf_qm_state = QM_READY; + hisi_acc_vdev->vf_qm_state = vf_data->vf_qm_state; + + ret = vf_qm_cache_wb(vf_qm); + if (ret) { + dev_err(dev, "failed to writeback QM Cache!\n"); + return ret; + } + + ret = qm_get_regs(vf_qm, vf_data); + if (ret) + return -EINVAL; + + /* Every reg is 32 bit, the dma address is 64 bit. */ + vf_data->eqe_dma = vf_data->qm_eqc_dw[2]; + vf_data->eqe_dma <<= QM_XQC_ADDR_OFFSET; + vf_data->eqe_dma |= vf_data->qm_eqc_dw[1]; + vf_data->aeqe_dma = vf_data->qm_aeqc_dw[2]; + vf_data->aeqe_dma <<= QM_XQC_ADDR_OFFSET; + vf_data->aeqe_dma |= vf_data->qm_aeqc_dw[1]; + + /* Through SQC_BT/CQC_BT to get sqc and cqc address */ + ret = qm_get_sqc(vf_qm, &vf_data->sqc_dma); + if (ret) { + dev_err(dev, "failed to read SQC addr!\n"); + return -EINVAL; + } + + ret = qm_get_cqc(vf_qm, &vf_data->cqc_dma); + if (ret) { + dev_err(dev, "failed to read CQC addr!\n"); + return -EINVAL; + } + + migf->total_length = sizeof(struct acc_vf_data); + return 0; +} + +/* Check the PF's RAS state and Function INT state */ +static int +hisi_acc_check_int_state(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct hisi_qm *vfqm = &hisi_acc_vdev->vf_qm; + struct hisi_qm *qm = hisi_acc_vdev->pf_qm; + struct pci_dev *vf_pdev = hisi_acc_vdev->vf_dev; + struct device *dev = &qm->pdev->dev; + u32 state; + + /* Check RAS state */ + state = qm_check_reg_state(qm, QM_ABNORMAL_INT_STATUS); + if (state) { + dev_err(dev, "failed to check QM RAS state!\n"); + return -EBUSY; + } + + /* Check Function Communication state between PF and VF */ + state = qm_check_reg_state(vfqm, QM_IFC_INT_STATUS); + if (state) { + dev_err(dev, "failed to check QM IFC INT state!\n"); + return -EBUSY; + } + state = qm_check_reg_state(vfqm, QM_IFC_INT_SET_V); + if (state) { + dev_err(dev, "failed to check QM IFC INT SET state!\n"); + return -EBUSY; + } + + /* Check submodule task state */ + switch (vf_pdev->device) { + case PCI_DEVICE_ID_HUAWEI_SEC_VF: + state = qm_check_reg_state(qm, SEC_CORE_INT_STATUS); + if (state) { + dev_err(dev, "failed to check QM SEC Core INT state!\n"); + return -EBUSY; + } + return 0; + case PCI_DEVICE_ID_HUAWEI_HPRE_VF: + state = qm_check_reg_state(qm, HPRE_HAC_INT_STATUS); + if (state) { + dev_err(dev, "failed to check QM HPRE HAC INT state!\n"); + return -EBUSY; + } + return 0; + case PCI_DEVICE_ID_HUAWEI_ZIP_VF: + state = qm_check_reg_state(qm, HZIP_CORE_INT_STATUS); + if (state) { + dev_err(dev, "failed to check QM ZIP Core INT state!\n"); + return -EBUSY; + } + return 0; + default: + dev_err(dev, "failed to detect acc module type!\n"); + return -EINVAL; + } +} + +static void hisi_acc_vf_disable_fd(struct hisi_acc_vf_migration_file *migf) +{ + mutex_lock(&migf->lock); + migf->disabled = true; + migf->total_length = 0; + migf->filp->f_pos = 0; + mutex_unlock(&migf->lock); +} + +static void hisi_acc_vf_disable_fds(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + if (hisi_acc_vdev->resuming_migf) { + hisi_acc_vf_disable_fd(hisi_acc_vdev->resuming_migf); + fput(hisi_acc_vdev->resuming_migf->filp); + hisi_acc_vdev->resuming_migf = NULL; + } + + if (hisi_acc_vdev->saving_migf) { + hisi_acc_vf_disable_fd(hisi_acc_vdev->saving_migf); + fput(hisi_acc_vdev->saving_migf->filp); + hisi_acc_vdev->saving_migf = NULL; + } +} + +/* + * This function is called in all state_mutex unlock cases to + * handle a 'deferred_reset' if exists. + */ +static void +hisi_acc_vf_state_mutex_unlock(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ +again: + spin_lock(&hisi_acc_vdev->reset_lock); + if (hisi_acc_vdev->deferred_reset) { + hisi_acc_vdev->deferred_reset = false; + spin_unlock(&hisi_acc_vdev->reset_lock); + hisi_acc_vdev->vf_qm_state = QM_NOT_READY; + hisi_acc_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING; + hisi_acc_vf_disable_fds(hisi_acc_vdev); + goto again; + } + mutex_unlock(&hisi_acc_vdev->state_mutex); + spin_unlock(&hisi_acc_vdev->reset_lock); +} + +static void hisi_acc_vf_start_device(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + + if (hisi_acc_vdev->vf_qm_state != QM_READY) + return; + + vf_qm_fun_reset(hisi_acc_vdev, vf_qm); +} + +static int hisi_acc_vf_load_state(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct device *dev = &hisi_acc_vdev->vf_dev->dev; + struct hisi_acc_vf_migration_file *migf = hisi_acc_vdev->resuming_migf; + int ret; + + /* Check dev compatibility */ + ret = vf_qm_check_match(hisi_acc_vdev, migf); + if (ret) { + dev_err(dev, "failed to match the VF!\n"); + return ret; + } + /* Recover data to VF */ + ret = vf_qm_load_data(hisi_acc_vdev, migf); + if (ret) { + dev_err(dev, "failed to recover the VF!\n"); + return ret; + } + + return 0; +} + +static int hisi_acc_vf_release_file(struct inode *inode, struct file *filp) +{ + struct hisi_acc_vf_migration_file *migf = filp->private_data; + + hisi_acc_vf_disable_fd(migf); + mutex_destroy(&migf->lock); + kfree(migf); + return 0; +} + +static ssize_t hisi_acc_vf_resume_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct hisi_acc_vf_migration_file *migf = filp->private_data; + loff_t requested_length; + ssize_t done = 0; + int ret; + + if (pos) + return -ESPIPE; + pos = &filp->f_pos; + + if (*pos < 0 || + check_add_overflow((loff_t)len, *pos, &requested_length)) + return -EINVAL; + + if (requested_length > sizeof(struct acc_vf_data)) + return -ENOMEM; + + mutex_lock(&migf->lock); + if (migf->disabled) { + done = -ENODEV; + goto out_unlock; + } + + ret = copy_from_user(&migf->vf_data, buf, len); + if (ret) { + done = -EFAULT; + goto out_unlock; + } + *pos += len; + done = len; + migf->total_length += len; +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations hisi_acc_vf_resume_fops = { + .owner = THIS_MODULE, + .write = hisi_acc_vf_resume_write, + .release = hisi_acc_vf_release_file, + .llseek = no_llseek, +}; + +static struct hisi_acc_vf_migration_file * +hisi_acc_vf_pci_resume(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct hisi_acc_vf_migration_file *migf; + + migf = kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp = anon_inode_getfile("hisi_acc_vf_mig", &hisi_acc_vf_resume_fops, migf, + O_WRONLY); + if (IS_ERR(migf->filp)) { + int err = PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + return migf; +} + +static ssize_t hisi_acc_vf_save_read(struct file *filp, char __user *buf, size_t len, + loff_t *pos) +{ + struct hisi_acc_vf_migration_file *migf = filp->private_data; + ssize_t done = 0; + int ret; + + if (pos) + return -ESPIPE; + pos = &filp->f_pos; + + mutex_lock(&migf->lock); + if (*pos > migf->total_length) { + done = -EINVAL; + goto out_unlock; + } + + if (migf->disabled) { + done = -ENODEV; + goto out_unlock; + } + + len = min_t(size_t, migf->total_length - *pos, len); + if (len) { + ret = copy_to_user(buf, &migf->vf_data, len); + if (ret) { + done = -EFAULT; + goto out_unlock; + } + *pos += len; + done = len; + } +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations hisi_acc_vf_save_fops = { + .owner = THIS_MODULE, + .read = hisi_acc_vf_save_read, + .release = hisi_acc_vf_release_file, + .llseek = no_llseek, +}; + +static struct hisi_acc_vf_migration_file * +hisi_acc_vf_stop_copy(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct hisi_acc_vf_migration_file *migf; + int ret; + + migf = kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp = anon_inode_getfile("hisi_acc_vf_mig", &hisi_acc_vf_save_fops, migf, + O_RDONLY); + if (IS_ERR(migf->filp)) { + int err = PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + ret = vf_qm_state_save(hisi_acc_vdev, migf); + if (ret) { + fput(migf->filp); + return ERR_PTR(ret); + } + + return migf; +} + +static int hisi_acc_vf_stop_device(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct device *dev = &hisi_acc_vdev->vf_dev->dev; + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + int ret; + + ret = vf_qm_func_stop(vf_qm); + if (ret) { + dev_err(dev, "failed to stop QM VF function!\n"); + return ret; + } + + ret = hisi_acc_check_int_state(hisi_acc_vdev); + if (ret) { + dev_err(dev, "failed to check QM INT state!\n"); + return ret; + } + return 0; +} + +static struct file * +hisi_acc_vf_set_device_state(struct hisi_acc_vf_core_device *hisi_acc_vdev, + u32 new) +{ + u32 cur = hisi_acc_vdev->mig_state; + int ret; + + if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_STOP) { + ret = hisi_acc_vf_stop_device(hisi_acc_vdev); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) { + struct hisi_acc_vf_migration_file *migf; + + migf = hisi_acc_vf_stop_copy(hisi_acc_vdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + hisi_acc_vdev->saving_migf = migf; + return migf->filp; + } + + if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP)) { + hisi_acc_vf_disable_fds(hisi_acc_vdev); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) { + struct hisi_acc_vf_migration_file *migf; + + migf = hisi_acc_vf_pci_resume(hisi_acc_vdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + hisi_acc_vdev->resuming_migf = migf; + return migf->filp; + } + + if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) { + ret = hisi_acc_vf_load_state(hisi_acc_vdev); + if (ret) + return ERR_PTR(ret); + hisi_acc_vf_disable_fds(hisi_acc_vdev); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING) { + hisi_acc_vf_start_device(hisi_acc_vdev); + return NULL; + } + + /* + * vfio_mig_get_next_state() does not use arcs other than the above + */ + WARN_ON(true); + return ERR_PTR(-EINVAL); +} + +static struct file * +hisi_acc_vfio_pci_set_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state new_state) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(vdev, + struct hisi_acc_vf_core_device, core_device.vdev); + enum vfio_device_mig_state next_state; + struct file *res = NULL; + int ret; + + mutex_lock(&hisi_acc_vdev->state_mutex); + while (new_state != hisi_acc_vdev->mig_state) { + ret = vfio_mig_get_next_state(vdev, + hisi_acc_vdev->mig_state, + new_state, &next_state); + if (ret) { + res = ERR_PTR(-EINVAL); + break; + } + + res = hisi_acc_vf_set_device_state(hisi_acc_vdev, next_state); + if (IS_ERR(res)) + break; + hisi_acc_vdev->mig_state = next_state; + if (WARN_ON(res && new_state != hisi_acc_vdev->mig_state)) { + fput(res); + res = ERR_PTR(-EINVAL); + break; + } + } + hisi_acc_vf_state_mutex_unlock(hisi_acc_vdev); + return res; +} + +static int +hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state *curr_state) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(vdev, + struct hisi_acc_vf_core_device, core_device.vdev); + + mutex_lock(&hisi_acc_vdev->state_mutex); + *curr_state = hisi_acc_vdev->mig_state; + hisi_acc_vf_state_mutex_unlock(hisi_acc_vdev); + return 0; +} + +static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev); + + if (hisi_acc_vdev->core_device.vdev.migration_flags != + VFIO_MIGRATION_STOP_COPY) + return; + + /* + * As the higher VFIO layers are holding locks across reset and using + * those same locks with the mm_lock we need to prevent ABBA deadlock + * with the state_mutex and mm_lock. + * In case the state_mutex was taken already we defer the cleanup work + * to the unlock flow of the other running context. + */ + spin_lock(&hisi_acc_vdev->reset_lock); + hisi_acc_vdev->deferred_reset = true; + if (!mutex_trylock(&hisi_acc_vdev->state_mutex)) { + spin_unlock(&hisi_acc_vdev->reset_lock); + return; + } + spin_unlock(&hisi_acc_vdev->reset_lock); + hisi_acc_vf_state_mutex_unlock(hisi_acc_vdev); +} + +static int hisi_acc_vf_qm_init(struct hisi_acc_vf_core_device *hisi_acc_vdev) +{ + struct vfio_pci_core_device *vdev = &hisi_acc_vdev->core_device; + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + struct pci_dev *vf_dev = vdev->pdev; + + /* + * ACC VF dev BAR2 region consists of both functional register space + * and migration control register space. For migration to work, we + * need access to both. Hence, we map the entire BAR2 region here. + * But unnecessarily exposing the migration BAR region to the Guest + * has the potential to prevent/corrupt the Guest migration. Hence, + * we restrict access to the migration control space from + * Guest(Please see mmap/ioctl/read/write override functions). + * + * Please note that it is OK to expose the entire VF BAR if migration + * is not supported or required as this cannot affect the ACC PF + * configurations. + * + * Also the HiSilicon ACC VF devices supported by this driver on + * HiSilicon hardware platforms are integrated end point devices + * and the platform lacks the capability to perform any PCIe P2P + * between these devices. + */ + + vf_qm->io_base = + ioremap(pci_resource_start(vf_dev, VFIO_PCI_BAR2_REGION_INDEX), + pci_resource_len(vf_dev, VFIO_PCI_BAR2_REGION_INDEX)); + if (!vf_qm->io_base) + return -EIO; + + vf_qm->fun_type = QM_HW_VF; + vf_qm->pdev = vf_dev; + mutex_init(&vf_qm->mailbox_lock); + + return 0; +} + +static struct hisi_qm *hisi_acc_get_pf_qm(struct pci_dev *pdev) +{ + struct hisi_qm *pf_qm; + struct pci_driver *pf_driver; + + if (!pdev->is_virtfn) + return NULL; + + switch (pdev->device) { + case PCI_DEVICE_ID_HUAWEI_SEC_VF: + pf_driver = hisi_sec_get_pf_driver(); + break; + case PCI_DEVICE_ID_HUAWEI_HPRE_VF: + pf_driver = hisi_hpre_get_pf_driver(); + break; + case PCI_DEVICE_ID_HUAWEI_ZIP_VF: + pf_driver = hisi_zip_get_pf_driver(); + break; + default: + return NULL; + } + + if (!pf_driver) + return NULL; + + pf_qm = pci_iov_get_pf_drvdata(pdev, pf_driver); + + return !IS_ERR(pf_qm) ? pf_qm : NULL; +} + +static int hisi_acc_pci_rw_access_check(struct vfio_device *core_vdev, + size_t count, loff_t *ppos, + size_t *new_count) +{ + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos); + struct vfio_pci_core_device *vdev = + container_of(core_vdev, struct vfio_pci_core_device, vdev); + + if (index == VFIO_PCI_BAR2_REGION_INDEX) { + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; + resource_size_t end = pci_resource_len(vdev->pdev, index) / 2; + + /* Check if access is for migration control region */ + if (pos >= end) + return -EINVAL; + + *new_count = min(count, (size_t)(end - pos)); + } + + return 0; +} + +static int hisi_acc_vfio_pci_mmap(struct vfio_device *core_vdev, + struct vm_area_struct *vma) +{ + struct vfio_pci_core_device *vdev = + container_of(core_vdev, struct vfio_pci_core_device, vdev); + unsigned int index; + + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT); + if (index == VFIO_PCI_BAR2_REGION_INDEX) { + u64 req_len, pgoff, req_start; + resource_size_t end = pci_resource_len(vdev->pdev, index) / 2; + + req_len = vma->vm_end - vma->vm_start; + pgoff = vma->vm_pgoff & + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); + req_start = pgoff << PAGE_SHIFT; + + if (req_start + req_len > end) + return -EINVAL; + } + + return vfio_pci_core_mmap(core_vdev, vma); +} + +static ssize_t hisi_acc_vfio_pci_write(struct vfio_device *core_vdev, + const char __user *buf, size_t count, + loff_t *ppos) +{ + size_t new_count = count; + int ret; + + ret = hisi_acc_pci_rw_access_check(core_vdev, count, ppos, &new_count); + if (ret) + return ret; + + return vfio_pci_core_write(core_vdev, buf, new_count, ppos); +} + +static ssize_t hisi_acc_vfio_pci_read(struct vfio_device *core_vdev, + char __user *buf, size_t count, + loff_t *ppos) +{ + size_t new_count = count; + int ret; + + ret = hisi_acc_pci_rw_access_check(core_vdev, count, ppos, &new_count); + if (ret) + return ret; + + return vfio_pci_core_read(core_vdev, buf, new_count, ppos); +} + +static long hisi_acc_vfio_pci_ioctl(struct vfio_device *core_vdev, unsigned int cmd, + unsigned long arg) +{ + if (cmd == VFIO_DEVICE_GET_REGION_INFO) { + struct vfio_pci_core_device *vdev = + container_of(core_vdev, struct vfio_pci_core_device, vdev); + struct pci_dev *pdev = vdev->pdev; + struct vfio_region_info info; + unsigned long minsz; + + minsz = offsetofend(struct vfio_region_info, offset); + + if (copy_from_user(&info, (void __user *)arg, minsz)) + return -EFAULT; + + if (info.argsz < minsz) + return -EINVAL; + + if (info.index == VFIO_PCI_BAR2_REGION_INDEX) { + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index); + + /* + * ACC VF dev BAR2 region consists of both functional + * register space and migration control register space. + * Report only the functional region to Guest. + */ + info.size = pci_resource_len(pdev, info.index) / 2; + + info.flags = VFIO_REGION_INFO_FLAG_READ | + VFIO_REGION_INFO_FLAG_WRITE | + VFIO_REGION_INFO_FLAG_MMAP; + + return copy_to_user((void __user *)arg, &info, minsz) ? + -EFAULT : 0; + } + } + return vfio_pci_core_ioctl(core_vdev, cmd, arg); +} + +static int hisi_acc_vfio_pci_open_device(struct vfio_device *core_vdev) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev, + struct hisi_acc_vf_core_device, core_device.vdev); + struct vfio_pci_core_device *vdev = &hisi_acc_vdev->core_device; + int ret; + + ret = vfio_pci_core_enable(vdev); + if (ret) + return ret; + + if (core_vdev->ops->migration_set_state) { + ret = hisi_acc_vf_qm_init(hisi_acc_vdev); + if (ret) { + vfio_pci_core_disable(vdev); + return ret; + } + hisi_acc_vdev->mig_state = VFIO_DEVICE_STATE_RUNNING; + } + + vfio_pci_core_finish_enable(vdev); + return 0; +} + +static void hisi_acc_vfio_pci_close_device(struct vfio_device *core_vdev) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = container_of(core_vdev, + struct hisi_acc_vf_core_device, core_device.vdev); + struct hisi_qm *vf_qm = &hisi_acc_vdev->vf_qm; + + iounmap(vf_qm->io_base); + vfio_pci_core_close_device(core_vdev); +} + +static const struct vfio_device_ops hisi_acc_vfio_pci_migrn_ops = { + .name = "hisi-acc-vfio-pci-migration", + .open_device = hisi_acc_vfio_pci_open_device, + .close_device = hisi_acc_vfio_pci_close_device, + .ioctl = hisi_acc_vfio_pci_ioctl, + .device_feature = vfio_pci_core_ioctl_feature, + .read = hisi_acc_vfio_pci_read, + .write = hisi_acc_vfio_pci_write, + .mmap = hisi_acc_vfio_pci_mmap, + .request = vfio_pci_core_request, + .match = vfio_pci_core_match, + .migration_set_state = hisi_acc_vfio_pci_set_device_state, + .migration_get_state = hisi_acc_vfio_pci_get_device_state, +}; + +static const struct vfio_device_ops hisi_acc_vfio_pci_ops = { + .name = "hisi-acc-vfio-pci", + .open_device = hisi_acc_vfio_pci_open_device, + .close_device = vfio_pci_core_close_device, + .ioctl = vfio_pci_core_ioctl, + .device_feature = vfio_pci_core_ioctl_feature, + .read = vfio_pci_core_read, + .write = vfio_pci_core_write, + .mmap = vfio_pci_core_mmap, + .request = vfio_pci_core_request, + .match = vfio_pci_core_match, +}; + +static int +hisi_acc_vfio_pci_migrn_init(struct hisi_acc_vf_core_device *hisi_acc_vdev, + struct pci_dev *pdev, struct hisi_qm *pf_qm) +{ + int vf_id; + + vf_id = pci_iov_vf_id(pdev); + if (vf_id < 0) + return vf_id; + + hisi_acc_vdev->vf_id = vf_id + 1; + hisi_acc_vdev->core_device.vdev.migration_flags = + VFIO_MIGRATION_STOP_COPY; + hisi_acc_vdev->pf_qm = pf_qm; + hisi_acc_vdev->vf_dev = pdev; + mutex_init(&hisi_acc_vdev->state_mutex); + + return 0; +} + +static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev; + struct hisi_qm *pf_qm; + int ret; + + hisi_acc_vdev = kzalloc(sizeof(*hisi_acc_vdev), GFP_KERNEL); + if (!hisi_acc_vdev) + return -ENOMEM; + + pf_qm = hisi_acc_get_pf_qm(pdev); + if (pf_qm && pf_qm->ver >= QM_HW_V3) { + ret = hisi_acc_vfio_pci_migrn_init(hisi_acc_vdev, pdev, pf_qm); + if (!ret) { + vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev, + &hisi_acc_vfio_pci_migrn_ops); + } else { + pci_warn(pdev, "migration support failed, continue with generic interface\n"); + vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev, + &hisi_acc_vfio_pci_ops); + } + } else { + vfio_pci_core_init_device(&hisi_acc_vdev->core_device, pdev, + &hisi_acc_vfio_pci_ops); + } + + ret = vfio_pci_core_register_device(&hisi_acc_vdev->core_device); + if (ret) + goto out_free; + + dev_set_drvdata(&pdev->dev, hisi_acc_vdev); + return 0; + +out_free: + vfio_pci_core_uninit_device(&hisi_acc_vdev->core_device); + kfree(hisi_acc_vdev); + return ret; +} + +static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev) +{ + struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev); + + vfio_pci_core_unregister_device(&hisi_acc_vdev->core_device); + vfio_pci_core_uninit_device(&hisi_acc_vdev->core_device); + kfree(hisi_acc_vdev); +} + +static const struct pci_device_id hisi_acc_vfio_pci_table[] = { + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_SEC_VF) }, + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_HPRE_VF) }, + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_HUAWEI, PCI_DEVICE_ID_HUAWEI_ZIP_VF) }, + { } +}; + +MODULE_DEVICE_TABLE(pci, hisi_acc_vfio_pci_table); + +static const struct pci_error_handlers hisi_acc_vf_err_handlers = { + .reset_done = hisi_acc_vf_pci_aer_reset_done, + .error_detected = vfio_pci_core_aer_err_detected, +}; + +static struct pci_driver hisi_acc_vfio_pci_driver = { + .name = KBUILD_MODNAME, + .id_table = hisi_acc_vfio_pci_table, + .probe = hisi_acc_vfio_pci_probe, + .remove = hisi_acc_vfio_pci_remove, + .err_handler = &hisi_acc_vf_err_handlers, +}; + +module_pci_driver(hisi_acc_vfio_pci_driver); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Liu Longfang <liulongfang@huawei.com>"); +MODULE_AUTHOR("Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>"); +MODULE_DESCRIPTION("HiSilicon VFIO PCI - VFIO PCI driver with live migration support for HiSilicon ACC device family"); diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h new file mode 100644 index 000000000000..5494f4983bbe --- /dev/null +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.h @@ -0,0 +1,116 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2021 HiSilicon Ltd. */ + +#ifndef HISI_ACC_VFIO_PCI_H +#define HISI_ACC_VFIO_PCI_H + +#include <linux/hisi_acc_qm.h> + +#define MB_POLL_PERIOD_US 10 +#define MB_POLL_TIMEOUT_US 1000 +#define QM_CACHE_WB_START 0x204 +#define QM_CACHE_WB_DONE 0x208 +#define QM_MB_CMD_PAUSE_QM 0xe +#define QM_ABNORMAL_INT_STATUS 0x100008 +#define QM_IFC_INT_STATUS 0x0028 +#define SEC_CORE_INT_STATUS 0x301008 +#define HPRE_HAC_INT_STATUS 0x301800 +#define HZIP_CORE_INT_STATUS 0x3010AC +#define QM_QUE_ISO_CFG 0x301154 + +#define QM_VFT_CFG_RDY 0x10006c +#define QM_VFT_CFG_OP_WR 0x100058 +#define QM_VFT_CFG_TYPE 0x10005c +#define QM_VFT_CFG 0x100060 +#define QM_VFT_CFG_OP_ENABLE 0x100054 +#define QM_VFT_CFG_DATA_L 0x100064 +#define QM_VFT_CFG_DATA_H 0x100068 + +#define ERROR_CHECK_TIMEOUT 100 +#define CHECK_DELAY_TIME 100 + +#define QM_SQC_VFT_BASE_SHIFT_V2 28 +#define QM_SQC_VFT_BASE_MASK_V2 GENMASK(15, 0) +#define QM_SQC_VFT_NUM_SHIFT_V2 45 +#define QM_SQC_VFT_NUM_MASK_V2 GENMASK(9, 0) + +/* RW regs */ +#define QM_REGS_MAX_LEN 7 +#define QM_REG_ADDR_OFFSET 0x0004 + +#define QM_XQC_ADDR_OFFSET 32U +#define QM_VF_AEQ_INT_MASK 0x0004 +#define QM_VF_EQ_INT_MASK 0x000c +#define QM_IFC_INT_SOURCE_V 0x0020 +#define QM_IFC_INT_MASK 0x0024 +#define QM_IFC_INT_SET_V 0x002c +#define QM_QUE_ISO_CFG_V 0x0030 +#define QM_PAGE_SIZE 0x0034 + +#define QM_EQC_DW0 0X8000 +#define QM_AEQC_DW0 0X8020 + +struct acc_vf_data { +#define QM_MATCH_SIZE offsetofend(struct acc_vf_data, qm_rsv_state) + /* QM match information */ +#define ACC_DEV_MAGIC 0XCDCDCDCDFEEDAACC + u64 acc_magic; + u32 qp_num; + u32 dev_id; + u32 que_iso_cfg; + u32 qp_base; + u32 vf_qm_state; + /* QM reserved match information */ + u32 qm_rsv_state[3]; + + /* QM RW regs */ + u32 aeq_int_mask; + u32 eq_int_mask; + u32 ifc_int_source; + u32 ifc_int_mask; + u32 ifc_int_set; + u32 page_size; + + /* QM_EQC_DW has 7 regs */ + u32 qm_eqc_dw[7]; + + /* QM_AEQC_DW has 7 regs */ + u32 qm_aeqc_dw[7]; + + /* QM reserved 5 regs */ + u32 qm_rsv_regs[5]; + u32 padding; + /* qm memory init information */ + u64 eqe_dma; + u64 aeqe_dma; + u64 sqc_dma; + u64 cqc_dma; +}; + +struct hisi_acc_vf_migration_file { + struct file *filp; + struct mutex lock; + bool disabled; + + struct acc_vf_data vf_data; + size_t total_length; +}; + +struct hisi_acc_vf_core_device { + struct vfio_pci_core_device core_device; + u8 deferred_reset:1; + /* for migration state */ + struct mutex state_mutex; + enum vfio_device_mig_state mig_state; + struct pci_dev *pf_dev; + struct pci_dev *vf_dev; + struct hisi_qm *pf_qm; + struct hisi_qm vf_qm; + u32 vf_qm_state; + int vf_id; + /* for reset handler */ + spinlock_t reset_lock; + struct hisi_acc_vf_migration_file *resuming_migf; + struct hisi_acc_vf_migration_file *saving_migf; +}; +#endif /* HISI_ACC_VFIO_PCI_H */ diff --git a/drivers/vfio/pci/mlx5/Kconfig b/drivers/vfio/pci/mlx5/Kconfig new file mode 100644 index 000000000000..29ba9c504a75 --- /dev/null +++ b/drivers/vfio/pci/mlx5/Kconfig @@ -0,0 +1,10 @@ +# SPDX-License-Identifier: GPL-2.0-only +config MLX5_VFIO_PCI + tristate "VFIO support for MLX5 PCI devices" + depends on MLX5_CORE + depends on VFIO_PCI_CORE + help + This provides migration support for MLX5 devices using the VFIO + framework. + + If you don't know what to do here, say N. diff --git a/drivers/vfio/pci/mlx5/Makefile b/drivers/vfio/pci/mlx5/Makefile new file mode 100644 index 000000000000..689627da7ff5 --- /dev/null +++ b/drivers/vfio/pci/mlx5/Makefile @@ -0,0 +1,4 @@ +# SPDX-License-Identifier: GPL-2.0-only +obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5-vfio-pci.o +mlx5-vfio-pci-y := main.o cmd.o + diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c new file mode 100644 index 000000000000..5c9f9218cc1d --- /dev/null +++ b/drivers/vfio/pci/mlx5/cmd.c @@ -0,0 +1,259 @@ +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved + */ + +#include "cmd.h" + +int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 out[MLX5_ST_SZ_DW(suspend_vhca_out)] = {}; + u32 in[MLX5_ST_SZ_DW(suspend_vhca_in)] = {}; + int ret; + + if (!mdev) + return -ENOTCONN; + + MLX5_SET(suspend_vhca_in, in, opcode, MLX5_CMD_OP_SUSPEND_VHCA); + MLX5_SET(suspend_vhca_in, in, vhca_id, vhca_id); + MLX5_SET(suspend_vhca_in, in, op_mod, op_mod); + + ret = mlx5_cmd_exec_inout(mdev, suspend_vhca, in, out); + mlx5_vf_put_core_dev(mdev); + return ret; +} + +int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 out[MLX5_ST_SZ_DW(resume_vhca_out)] = {}; + u32 in[MLX5_ST_SZ_DW(resume_vhca_in)] = {}; + int ret; + + if (!mdev) + return -ENOTCONN; + + MLX5_SET(resume_vhca_in, in, opcode, MLX5_CMD_OP_RESUME_VHCA); + MLX5_SET(resume_vhca_in, in, vhca_id, vhca_id); + MLX5_SET(resume_vhca_in, in, op_mod, op_mod); + + ret = mlx5_cmd_exec_inout(mdev, resume_vhca, in, out); + mlx5_vf_put_core_dev(mdev); + return ret; +} + +int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id, + size_t *state_size) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 out[MLX5_ST_SZ_DW(query_vhca_migration_state_out)] = {}; + u32 in[MLX5_ST_SZ_DW(query_vhca_migration_state_in)] = {}; + int ret; + + if (!mdev) + return -ENOTCONN; + + MLX5_SET(query_vhca_migration_state_in, in, opcode, + MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE); + MLX5_SET(query_vhca_migration_state_in, in, vhca_id, vhca_id); + MLX5_SET(query_vhca_migration_state_in, in, op_mod, 0); + + ret = mlx5_cmd_exec_inout(mdev, query_vhca_migration_state, in, out); + if (ret) + goto end; + + *state_size = MLX5_GET(query_vhca_migration_state_out, out, + required_umem_size); + +end: + mlx5_vf_put_core_dev(mdev); + return ret; +} + +int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 in[MLX5_ST_SZ_DW(query_hca_cap_in)] = {}; + int out_size; + void *out; + int ret; + + if (!mdev) + return -ENOTCONN; + + out_size = MLX5_ST_SZ_BYTES(query_hca_cap_out); + out = kzalloc(out_size, GFP_KERNEL); + if (!out) { + ret = -ENOMEM; + goto end; + } + + MLX5_SET(query_hca_cap_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_CAP); + MLX5_SET(query_hca_cap_in, in, other_function, 1); + MLX5_SET(query_hca_cap_in, in, function_id, function_id); + MLX5_SET(query_hca_cap_in, in, op_mod, + MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE << 1 | + HCA_CAP_OPMOD_GET_CUR); + + ret = mlx5_cmd_exec_inout(mdev, query_hca_cap, in, out); + if (ret) + goto err_exec; + + *vhca_id = MLX5_GET(query_hca_cap_out, out, + capability.cmd_hca_cap.vhca_id); + +err_exec: + kfree(out); +end: + mlx5_vf_put_core_dev(mdev); + return ret; +} + +static int _create_state_mkey(struct mlx5_core_dev *mdev, u32 pdn, + struct mlx5_vf_migration_file *migf, u32 *mkey) +{ + size_t npages = DIV_ROUND_UP(migf->total_length, PAGE_SIZE); + struct sg_dma_page_iter dma_iter; + int err = 0, inlen; + __be64 *mtt; + void *mkc; + u32 *in; + + inlen = MLX5_ST_SZ_BYTES(create_mkey_in) + + sizeof(*mtt) * round_up(npages, 2); + + in = kvzalloc(inlen, GFP_KERNEL); + if (!in) + return -ENOMEM; + + MLX5_SET(create_mkey_in, in, translations_octword_actual_size, + DIV_ROUND_UP(npages, 2)); + mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt); + + for_each_sgtable_dma_page(&migf->table.sgt, &dma_iter, 0) + *mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter)); + + mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); + MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT); + MLX5_SET(mkc, mkc, lr, 1); + MLX5_SET(mkc, mkc, lw, 1); + MLX5_SET(mkc, mkc, rr, 1); + MLX5_SET(mkc, mkc, rw, 1); + MLX5_SET(mkc, mkc, pd, pdn); + MLX5_SET(mkc, mkc, bsf_octword_size, 0); + MLX5_SET(mkc, mkc, qpn, 0xffffff); + MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT); + MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2)); + MLX5_SET64(mkc, mkc, len, migf->total_length); + err = mlx5_core_create_mkey(mdev, mkey, in, inlen); + kvfree(in); + return err; +} + +int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id, + struct mlx5_vf_migration_file *migf) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {}; + u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {}; + u32 pdn, mkey; + int err; + + if (!mdev) + return -ENOTCONN; + + err = mlx5_core_alloc_pd(mdev, &pdn); + if (err) + goto end; + + err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, + 0); + if (err) + goto err_dma_map; + + err = _create_state_mkey(mdev, pdn, migf, &mkey); + if (err) + goto err_create_mkey; + + MLX5_SET(save_vhca_state_in, in, opcode, + MLX5_CMD_OP_SAVE_VHCA_STATE); + MLX5_SET(save_vhca_state_in, in, op_mod, 0); + MLX5_SET(save_vhca_state_in, in, vhca_id, vhca_id); + MLX5_SET(save_vhca_state_in, in, mkey, mkey); + MLX5_SET(save_vhca_state_in, in, size, migf->total_length); + + err = mlx5_cmd_exec_inout(mdev, save_vhca_state, in, out); + if (err) + goto err_exec; + + migf->total_length = + MLX5_GET(save_vhca_state_out, out, actual_image_size); + + mlx5_core_destroy_mkey(mdev, mkey); + mlx5_core_dealloc_pd(mdev, pdn); + dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0); + mlx5_vf_put_core_dev(mdev); + + return 0; + +err_exec: + mlx5_core_destroy_mkey(mdev, mkey); +err_create_mkey: + dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_FROM_DEVICE, 0); +err_dma_map: + mlx5_core_dealloc_pd(mdev, pdn); +end: + mlx5_vf_put_core_dev(mdev); + return err; +} + +int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id, + struct mlx5_vf_migration_file *migf) +{ + struct mlx5_core_dev *mdev = mlx5_vf_get_core_dev(pdev); + u32 out[MLX5_ST_SZ_DW(save_vhca_state_out)] = {}; + u32 in[MLX5_ST_SZ_DW(save_vhca_state_in)] = {}; + u32 pdn, mkey; + int err; + + if (!mdev) + return -ENOTCONN; + + mutex_lock(&migf->lock); + if (!migf->total_length) { + err = -EINVAL; + goto end; + } + + err = mlx5_core_alloc_pd(mdev, &pdn); + if (err) + goto end; + + err = dma_map_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0); + if (err) + goto err_reg; + + err = _create_state_mkey(mdev, pdn, migf, &mkey); + if (err) + goto err_mkey; + + MLX5_SET(load_vhca_state_in, in, opcode, + MLX5_CMD_OP_LOAD_VHCA_STATE); + MLX5_SET(load_vhca_state_in, in, op_mod, 0); + MLX5_SET(load_vhca_state_in, in, vhca_id, vhca_id); + MLX5_SET(load_vhca_state_in, in, mkey, mkey); + MLX5_SET(load_vhca_state_in, in, size, migf->total_length); + + err = mlx5_cmd_exec_inout(mdev, load_vhca_state, in, out); + + mlx5_core_destroy_mkey(mdev, mkey); +err_mkey: + dma_unmap_sgtable(mdev->device, &migf->table.sgt, DMA_TO_DEVICE, 0); +err_reg: + mlx5_core_dealloc_pd(mdev, pdn); +end: + mlx5_vf_put_core_dev(mdev); + mutex_unlock(&migf->lock); + return err; +} diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h new file mode 100644 index 000000000000..1392a11a9cc0 --- /dev/null +++ b/drivers/vfio/pci/mlx5/cmd.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + */ + +#ifndef MLX5_VFIO_CMD_H +#define MLX5_VFIO_CMD_H + +#include <linux/kernel.h> +#include <linux/mlx5/driver.h> + +struct mlx5_vf_migration_file { + struct file *filp; + struct mutex lock; + bool disabled; + + struct sg_append_table table; + size_t total_length; + size_t allocated_length; + + /* Optimize mlx5vf_get_migration_page() for sequential access */ + struct scatterlist *last_offset_sg; + unsigned int sg_last_entry; + unsigned long last_offset; +}; + +int mlx5vf_cmd_suspend_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod); +int mlx5vf_cmd_resume_vhca(struct pci_dev *pdev, u16 vhca_id, u16 op_mod); +int mlx5vf_cmd_query_vhca_migration_state(struct pci_dev *pdev, u16 vhca_id, + size_t *state_size); +int mlx5vf_cmd_get_vhca_id(struct pci_dev *pdev, u16 function_id, u16 *vhca_id); +int mlx5vf_cmd_save_vhca_state(struct pci_dev *pdev, u16 vhca_id, + struct mlx5_vf_migration_file *migf); +int mlx5vf_cmd_load_vhca_state(struct pci_dev *pdev, u16 vhca_id, + struct mlx5_vf_migration_file *migf); +#endif /* MLX5_VFIO_CMD_H */ diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c new file mode 100644 index 000000000000..bbec5d288fee --- /dev/null +++ b/drivers/vfio/pci/mlx5/main.c @@ -0,0 +1,676 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (c) 2021-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved + */ + +#include <linux/device.h> +#include <linux/eventfd.h> +#include <linux/file.h> +#include <linux/interrupt.h> +#include <linux/iommu.h> +#include <linux/module.h> +#include <linux/mutex.h> +#include <linux/notifier.h> +#include <linux/pci.h> +#include <linux/pm_runtime.h> +#include <linux/types.h> +#include <linux/uaccess.h> +#include <linux/vfio.h> +#include <linux/sched/mm.h> +#include <linux/vfio_pci_core.h> +#include <linux/anon_inodes.h> + +#include "cmd.h" + +/* Arbitrary to prevent userspace from consuming endless memory */ +#define MAX_MIGRATION_SIZE (512*1024*1024) + +struct mlx5vf_pci_core_device { + struct vfio_pci_core_device core_device; + u16 vhca_id; + u8 migrate_cap:1; + u8 deferred_reset:1; + /* protect migration state */ + struct mutex state_mutex; + enum vfio_device_mig_state mig_state; + /* protect the reset_done flow */ + spinlock_t reset_lock; + struct mlx5_vf_migration_file *resuming_migf; + struct mlx5_vf_migration_file *saving_migf; +}; + +static struct page * +mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf, + unsigned long offset) +{ + unsigned long cur_offset = 0; + struct scatterlist *sg; + unsigned int i; + + /* All accesses are sequential */ + if (offset < migf->last_offset || !migf->last_offset_sg) { + migf->last_offset = 0; + migf->last_offset_sg = migf->table.sgt.sgl; + migf->sg_last_entry = 0; + } + + cur_offset = migf->last_offset; + + for_each_sg(migf->last_offset_sg, sg, + migf->table.sgt.orig_nents - migf->sg_last_entry, i) { + if (offset < sg->length + cur_offset) { + migf->last_offset_sg = sg; + migf->sg_last_entry += i; + migf->last_offset = cur_offset; + return nth_page(sg_page(sg), + (offset - cur_offset) / PAGE_SIZE); + } + cur_offset += sg->length; + } + return NULL; +} + +static int mlx5vf_add_migration_pages(struct mlx5_vf_migration_file *migf, + unsigned int npages) +{ + unsigned int to_alloc = npages; + struct page **page_list; + unsigned long filled; + unsigned int to_fill; + int ret; + + to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list)); + page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL); + if (!page_list) + return -ENOMEM; + + do { + filled = alloc_pages_bulk_array(GFP_KERNEL, to_fill, page_list); + if (!filled) { + ret = -ENOMEM; + goto err; + } + to_alloc -= filled; + ret = sg_alloc_append_table_from_pages( + &migf->table, page_list, filled, 0, + filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC, + GFP_KERNEL); + + if (ret) + goto err; + migf->allocated_length += filled * PAGE_SIZE; + /* clean input for another bulk allocation */ + memset(page_list, 0, filled * sizeof(*page_list)); + to_fill = min_t(unsigned int, to_alloc, + PAGE_SIZE / sizeof(*page_list)); + } while (to_alloc > 0); + + kvfree(page_list); + return 0; + +err: + kvfree(page_list); + return ret; +} + +static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf) +{ + struct sg_page_iter sg_iter; + + mutex_lock(&migf->lock); + /* Undo alloc_pages_bulk_array() */ + for_each_sgtable_page(&migf->table.sgt, &sg_iter, 0) + __free_page(sg_page_iter_page(&sg_iter)); + sg_free_append_table(&migf->table); + migf->disabled = true; + migf->total_length = 0; + migf->allocated_length = 0; + migf->filp->f_pos = 0; + mutex_unlock(&migf->lock); +} + +static int mlx5vf_release_file(struct inode *inode, struct file *filp) +{ + struct mlx5_vf_migration_file *migf = filp->private_data; + + mlx5vf_disable_fd(migf); + mutex_destroy(&migf->lock); + kfree(migf); + return 0; +} + +static ssize_t mlx5vf_save_read(struct file *filp, char __user *buf, size_t len, + loff_t *pos) +{ + struct mlx5_vf_migration_file *migf = filp->private_data; + ssize_t done = 0; + + if (pos) + return -ESPIPE; + pos = &filp->f_pos; + + mutex_lock(&migf->lock); + if (*pos > migf->total_length) { + done = -EINVAL; + goto out_unlock; + } + if (migf->disabled) { + done = -ENODEV; + goto out_unlock; + } + + len = min_t(size_t, migf->total_length - *pos, len); + while (len) { + size_t page_offset; + struct page *page; + size_t page_len; + u8 *from_buff; + int ret; + + page_offset = (*pos) % PAGE_SIZE; + page = mlx5vf_get_migration_page(migf, *pos - page_offset); + if (!page) { + if (done == 0) + done = -EINVAL; + goto out_unlock; + } + + page_len = min_t(size_t, len, PAGE_SIZE - page_offset); + from_buff = kmap_local_page(page); + ret = copy_to_user(buf, from_buff + page_offset, page_len); + kunmap_local(from_buff); + if (ret) { + done = -EFAULT; + goto out_unlock; + } + *pos += page_len; + len -= page_len; + done += page_len; + buf += page_len; + } + +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations mlx5vf_save_fops = { + .owner = THIS_MODULE, + .read = mlx5vf_save_read, + .release = mlx5vf_release_file, + .llseek = no_llseek, +}; + +static struct mlx5_vf_migration_file * +mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev) +{ + struct mlx5_vf_migration_file *migf; + int ret; + + migf = kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_save_fops, migf, + O_RDONLY); + if (IS_ERR(migf->filp)) { + int err = PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + + ret = mlx5vf_cmd_query_vhca_migration_state( + mvdev->core_device.pdev, mvdev->vhca_id, &migf->total_length); + if (ret) + goto out_free; + + ret = mlx5vf_add_migration_pages( + migf, DIV_ROUND_UP_ULL(migf->total_length, PAGE_SIZE)); + if (ret) + goto out_free; + + ret = mlx5vf_cmd_save_vhca_state(mvdev->core_device.pdev, + mvdev->vhca_id, migf); + if (ret) + goto out_free; + return migf; +out_free: + fput(migf->filp); + return ERR_PTR(ret); +} + +static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf, + size_t len, loff_t *pos) +{ + struct mlx5_vf_migration_file *migf = filp->private_data; + loff_t requested_length; + ssize_t done = 0; + + if (pos) + return -ESPIPE; + pos = &filp->f_pos; + + if (*pos < 0 || + check_add_overflow((loff_t)len, *pos, &requested_length)) + return -EINVAL; + + if (requested_length > MAX_MIGRATION_SIZE) + return -ENOMEM; + + mutex_lock(&migf->lock); + if (migf->disabled) { + done = -ENODEV; + goto out_unlock; + } + + if (migf->allocated_length < requested_length) { + done = mlx5vf_add_migration_pages( + migf, + DIV_ROUND_UP(requested_length - migf->allocated_length, + PAGE_SIZE)); + if (done) + goto out_unlock; + } + + while (len) { + size_t page_offset; + struct page *page; + size_t page_len; + u8 *to_buff; + int ret; + + page_offset = (*pos) % PAGE_SIZE; + page = mlx5vf_get_migration_page(migf, *pos - page_offset); + if (!page) { + if (done == 0) + done = -EINVAL; + goto out_unlock; + } + + page_len = min_t(size_t, len, PAGE_SIZE - page_offset); + to_buff = kmap_local_page(page); + ret = copy_from_user(to_buff + page_offset, buf, page_len); + kunmap_local(to_buff); + if (ret) { + done = -EFAULT; + goto out_unlock; + } + *pos += page_len; + len -= page_len; + done += page_len; + buf += page_len; + migf->total_length += page_len; + } +out_unlock: + mutex_unlock(&migf->lock); + return done; +} + +static const struct file_operations mlx5vf_resume_fops = { + .owner = THIS_MODULE, + .write = mlx5vf_resume_write, + .release = mlx5vf_release_file, + .llseek = no_llseek, +}; + +static struct mlx5_vf_migration_file * +mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev) +{ + struct mlx5_vf_migration_file *migf; + + migf = kzalloc(sizeof(*migf), GFP_KERNEL); + if (!migf) + return ERR_PTR(-ENOMEM); + + migf->filp = anon_inode_getfile("mlx5vf_mig", &mlx5vf_resume_fops, migf, + O_WRONLY); + if (IS_ERR(migf->filp)) { + int err = PTR_ERR(migf->filp); + + kfree(migf); + return ERR_PTR(err); + } + stream_open(migf->filp->f_inode, migf->filp); + mutex_init(&migf->lock); + return migf; +} + +static void mlx5vf_disable_fds(struct mlx5vf_pci_core_device *mvdev) +{ + if (mvdev->resuming_migf) { + mlx5vf_disable_fd(mvdev->resuming_migf); + fput(mvdev->resuming_migf->filp); + mvdev->resuming_migf = NULL; + } + if (mvdev->saving_migf) { + mlx5vf_disable_fd(mvdev->saving_migf); + fput(mvdev->saving_migf->filp); + mvdev->saving_migf = NULL; + } +} + +static struct file * +mlx5vf_pci_step_device_state_locked(struct mlx5vf_pci_core_device *mvdev, + u32 new) +{ + u32 cur = mvdev->mig_state; + int ret; + + if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_STOP) { + ret = mlx5vf_cmd_suspend_vhca( + mvdev->core_device.pdev, mvdev->vhca_id, + MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_RESPONDER); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RUNNING_P2P) { + ret = mlx5vf_cmd_resume_vhca( + mvdev->core_device.pdev, mvdev->vhca_id, + MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_RESPONDER); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_RUNNING && new == VFIO_DEVICE_STATE_RUNNING_P2P) { + ret = mlx5vf_cmd_suspend_vhca( + mvdev->core_device.pdev, mvdev->vhca_id, + MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_INITIATOR); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_RUNNING_P2P && new == VFIO_DEVICE_STATE_RUNNING) { + ret = mlx5vf_cmd_resume_vhca( + mvdev->core_device.pdev, mvdev->vhca_id, + MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_INITIATOR); + if (ret) + return ERR_PTR(ret); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_STOP_COPY) { + struct mlx5_vf_migration_file *migf; + + migf = mlx5vf_pci_save_device_data(mvdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + mvdev->saving_migf = migf; + return migf->filp; + } + + if ((cur == VFIO_DEVICE_STATE_STOP_COPY && new == VFIO_DEVICE_STATE_STOP)) { + mlx5vf_disable_fds(mvdev); + return NULL; + } + + if (cur == VFIO_DEVICE_STATE_STOP && new == VFIO_DEVICE_STATE_RESUMING) { + struct mlx5_vf_migration_file *migf; + + migf = mlx5vf_pci_resume_device_data(mvdev); + if (IS_ERR(migf)) + return ERR_CAST(migf); + get_file(migf->filp); + mvdev->resuming_migf = migf; + return migf->filp; + } + + if (cur == VFIO_DEVICE_STATE_RESUMING && new == VFIO_DEVICE_STATE_STOP) { + ret = mlx5vf_cmd_load_vhca_state(mvdev->core_device.pdev, + mvdev->vhca_id, + mvdev->resuming_migf); + if (ret) + return ERR_PTR(ret); + mlx5vf_disable_fds(mvdev); + return NULL; + } + + /* + * vfio_mig_get_next_state() does not use arcs other than the above + */ + WARN_ON(true); + return ERR_PTR(-EINVAL); +} + +/* + * This function is called in all state_mutex unlock cases to + * handle a 'deferred_reset' if exists. + */ +static void mlx5vf_state_mutex_unlock(struct mlx5vf_pci_core_device *mvdev) +{ +again: + spin_lock(&mvdev->reset_lock); + if (mvdev->deferred_reset) { + mvdev->deferred_reset = false; + spin_unlock(&mvdev->reset_lock); + mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING; + mlx5vf_disable_fds(mvdev); + goto again; + } + mutex_unlock(&mvdev->state_mutex); + spin_unlock(&mvdev->reset_lock); +} + +static struct file * +mlx5vf_pci_set_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state new_state) +{ + struct mlx5vf_pci_core_device *mvdev = container_of( + vdev, struct mlx5vf_pci_core_device, core_device.vdev); + enum vfio_device_mig_state next_state; + struct file *res = NULL; + int ret; + + mutex_lock(&mvdev->state_mutex); + while (new_state != mvdev->mig_state) { + ret = vfio_mig_get_next_state(vdev, mvdev->mig_state, + new_state, &next_state); + if (ret) { + res = ERR_PTR(ret); + break; + } + res = mlx5vf_pci_step_device_state_locked(mvdev, next_state); + if (IS_ERR(res)) + break; + mvdev->mig_state = next_state; + if (WARN_ON(res && new_state != mvdev->mig_state)) { + fput(res); + res = ERR_PTR(-EINVAL); + break; + } + } + mlx5vf_state_mutex_unlock(mvdev); + return res; +} + +static int mlx5vf_pci_get_device_state(struct vfio_device *vdev, + enum vfio_device_mig_state *curr_state) +{ + struct mlx5vf_pci_core_device *mvdev = container_of( + vdev, struct mlx5vf_pci_core_device, core_device.vdev); + + mutex_lock(&mvdev->state_mutex); + *curr_state = mvdev->mig_state; + mlx5vf_state_mutex_unlock(mvdev); + return 0; +} + +static void mlx5vf_pci_aer_reset_done(struct pci_dev *pdev) +{ + struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev); + + if (!mvdev->migrate_cap) + return; + + /* + * As the higher VFIO layers are holding locks across reset and using + * those same locks with the mm_lock we need to prevent ABBA deadlock + * with the state_mutex and mm_lock. + * In case the state_mutex was taken already we defer the cleanup work + * to the unlock flow of the other running context. + */ + spin_lock(&mvdev->reset_lock); + mvdev->deferred_reset = true; + if (!mutex_trylock(&mvdev->state_mutex)) { + spin_unlock(&mvdev->reset_lock); + return; + } + spin_unlock(&mvdev->reset_lock); + mlx5vf_state_mutex_unlock(mvdev); +} + +static int mlx5vf_pci_open_device(struct vfio_device *core_vdev) +{ + struct mlx5vf_pci_core_device *mvdev = container_of( + core_vdev, struct mlx5vf_pci_core_device, core_device.vdev); + struct vfio_pci_core_device *vdev = &mvdev->core_device; + int vf_id; + int ret; + + ret = vfio_pci_core_enable(vdev); + if (ret) + return ret; + + if (!mvdev->migrate_cap) { + vfio_pci_core_finish_enable(vdev); + return 0; + } + + vf_id = pci_iov_vf_id(vdev->pdev); + if (vf_id < 0) { + ret = vf_id; + goto out_disable; + } + + ret = mlx5vf_cmd_get_vhca_id(vdev->pdev, vf_id + 1, &mvdev->vhca_id); + if (ret) + goto out_disable; + + mvdev->mig_state = VFIO_DEVICE_STATE_RUNNING; + vfio_pci_core_finish_enable(vdev); + return 0; +out_disable: + vfio_pci_core_disable(vdev); + return ret; +} + +static void mlx5vf_pci_close_device(struct vfio_device *core_vdev) +{ + struct mlx5vf_pci_core_device *mvdev = container_of( + core_vdev, struct mlx5vf_pci_core_device, core_device.vdev); + + mlx5vf_disable_fds(mvdev); + vfio_pci_core_close_device(core_vdev); +} + +static const struct vfio_device_ops mlx5vf_pci_ops = { + .name = "mlx5-vfio-pci", + .open_device = mlx5vf_pci_open_device, + .close_device = mlx5vf_pci_close_device, + .ioctl = vfio_pci_core_ioctl, + .device_feature = vfio_pci_core_ioctl_feature, + .read = vfio_pci_core_read, + .write = vfio_pci_core_write, + .mmap = vfio_pci_core_mmap, + .request = vfio_pci_core_request, + .match = vfio_pci_core_match, + .migration_set_state = mlx5vf_pci_set_device_state, + .migration_get_state = mlx5vf_pci_get_device_state, +}; + +static int mlx5vf_pci_probe(struct pci_dev *pdev, + const struct pci_device_id *id) +{ + struct mlx5vf_pci_core_device *mvdev; + int ret; + + mvdev = kzalloc(sizeof(*mvdev), GFP_KERNEL); + if (!mvdev) + return -ENOMEM; + vfio_pci_core_init_device(&mvdev->core_device, pdev, &mlx5vf_pci_ops); + + if (pdev->is_virtfn) { + struct mlx5_core_dev *mdev = + mlx5_vf_get_core_dev(pdev); + + if (mdev) { + if (MLX5_CAP_GEN(mdev, migration)) { + mvdev->migrate_cap = 1; + mvdev->core_device.vdev.migration_flags = + VFIO_MIGRATION_STOP_COPY | + VFIO_MIGRATION_P2P; + mutex_init(&mvdev->state_mutex); + spin_lock_init(&mvdev->reset_lock); + } + mlx5_vf_put_core_dev(mdev); + } + } + + ret = vfio_pci_core_register_device(&mvdev->core_device); + if (ret) + goto out_free; + + dev_set_drvdata(&pdev->dev, mvdev); + return 0; + +out_free: + vfio_pci_core_uninit_device(&mvdev->core_device); + kfree(mvdev); + return ret; +} + +static void mlx5vf_pci_remove(struct pci_dev *pdev) +{ + struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev); + + vfio_pci_core_unregister_device(&mvdev->core_device); + vfio_pci_core_uninit_device(&mvdev->core_device); + kfree(mvdev); +} + +static const struct pci_device_id mlx5vf_pci_table[] = { + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_MELLANOX, 0x101e) }, /* ConnectX Family mlx5Gen Virtual Function */ + {} +}; + +MODULE_DEVICE_TABLE(pci, mlx5vf_pci_table); + +static const struct pci_error_handlers mlx5vf_err_handlers = { + .reset_done = mlx5vf_pci_aer_reset_done, + .error_detected = vfio_pci_core_aer_err_detected, +}; + +static struct pci_driver mlx5vf_pci_driver = { + .name = KBUILD_MODNAME, + .id_table = mlx5vf_pci_table, + .probe = mlx5vf_pci_probe, + .remove = mlx5vf_pci_remove, + .err_handler = &mlx5vf_err_handlers, +}; + +static void __exit mlx5vf_pci_cleanup(void) +{ + pci_unregister_driver(&mlx5vf_pci_driver); +} + +static int __init mlx5vf_pci_init(void) +{ + return pci_register_driver(&mlx5vf_pci_driver); +} + +module_init(mlx5vf_pci_init); +module_exit(mlx5vf_pci_cleanup); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Max Gurtovoy <mgurtovoy@nvidia.com>"); +MODULE_AUTHOR("Yishai Hadas <yishaih@nvidia.com>"); +MODULE_DESCRIPTION( + "MLX5 VFIO PCI - User Level meta-driver for MLX5 device family"); diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index a5ce92beb655..2b047469e02f 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -130,6 +130,7 @@ static const struct vfio_device_ops vfio_pci_ops = { .open_device = vfio_pci_open_device, .close_device = vfio_pci_core_close_device, .ioctl = vfio_pci_core_ioctl, + .device_feature = vfio_pci_core_ioctl_feature, .read = vfio_pci_core_read, .write = vfio_pci_core_write, .mmap = vfio_pci_core_mmap, diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index f948e6cd2993..b7bb16f92ac6 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -228,6 +228,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat if (!ret) { /* D3 might be unsupported via quirk, skip unless in D3 */ if (needs_save && pdev->current_state >= PCI_D3hot) { + /* + * The current PCI state will be saved locally in + * 'pm_save' during the D3hot transition. When the + * device state is changed to D0 again with the current + * function, then pci_store_saved_state() will restore + * the state and will free the memory pointed by + * 'pm_save'. There are few cases where the PCI power + * state can be changed to D0 without the involvement + * of the driver. For these cases, free the earlier + * allocated memory first before overwriting 'pm_save' + * to prevent the memory leak. + */ + kfree(vdev->pm_save); vdev->pm_save = pci_store_saved_state(pdev); } else if (needs_restore) { pci_load_and_free_saved_state(pdev, &vdev->pm_save); @@ -322,6 +335,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev) /* For needs_reset */ lockdep_assert_held(&vdev->vdev.dev_set->lock); + /* + * This function can be invoked while the power state is non-D0. + * This function calls __pci_reset_function_locked() which internally + * can use pci_pm_reset() for the function reset. pci_pm_reset() will + * fail if the power state is non-D0. Also, for the devices which + * have NoSoftRst-, the reset function can cause the PCI config space + * reset without restoring the original state (saved locally in + * 'vdev->pm_save'). + */ + vfio_pci_set_power_state(vdev, PCI_D0); + /* Stop the device from further DMA */ pci_clear_master(pdev); @@ -921,6 +945,19 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, return -EINVAL; vfio_pci_zap_and_down_write_memory_lock(vdev); + + /* + * This function can be invoked while the power state is non-D0. + * If pci_try_reset_function() has been called while the power + * state is non-D0, then pci_try_reset_function() will + * internally set the power state to D0 without vfio driver + * involvement. For the devices which have NoSoftRst-, the + * reset function can cause the PCI config space reset without + * restoring the original state (saved locally in + * 'vdev->pm_save'). + */ + vfio_pci_set_power_state(vdev, PCI_D0); + ret = pci_try_reset_function(vdev->pdev); up_write(&vdev->memory_lock); @@ -1114,70 +1151,50 @@ hot_reset_release: return vfio_pci_ioeventfd(vdev, ioeventfd.offset, ioeventfd.data, count, ioeventfd.fd); - } else if (cmd == VFIO_DEVICE_FEATURE) { - struct vfio_device_feature feature; - uuid_t uuid; - - minsz = offsetofend(struct vfio_device_feature, flags); - - if (copy_from_user(&feature, (void __user *)arg, minsz)) - return -EFAULT; - - if (feature.argsz < minsz) - return -EINVAL; - - /* Check unknown flags */ - if (feature.flags & ~(VFIO_DEVICE_FEATURE_MASK | - VFIO_DEVICE_FEATURE_SET | - VFIO_DEVICE_FEATURE_GET | - VFIO_DEVICE_FEATURE_PROBE)) - return -EINVAL; - - /* GET & SET are mutually exclusive except with PROBE */ - if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) && - (feature.flags & VFIO_DEVICE_FEATURE_SET) && - (feature.flags & VFIO_DEVICE_FEATURE_GET)) - return -EINVAL; - - switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { - case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: - if (!vdev->vf_token) - return -ENOTTY; - - /* - * We do not support GET of the VF Token UUID as this - * could expose the token of the previous device user. - */ - if (feature.flags & VFIO_DEVICE_FEATURE_GET) - return -EINVAL; - - if (feature.flags & VFIO_DEVICE_FEATURE_PROBE) - return 0; + } + return -ENOTTY; +} +EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); - /* Don't SET unless told to do so */ - if (!(feature.flags & VFIO_DEVICE_FEATURE_SET)) - return -EINVAL; +static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags, + void __user *arg, size_t argsz) +{ + struct vfio_pci_core_device *vdev = + container_of(device, struct vfio_pci_core_device, vdev); + uuid_t uuid; + int ret; - if (feature.argsz < minsz + sizeof(uuid)) - return -EINVAL; + if (!vdev->vf_token) + return -ENOTTY; + /* + * We do not support GET of the VF Token UUID as this could + * expose the token of the previous device user. + */ + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, + sizeof(uuid)); + if (ret != 1) + return ret; - if (copy_from_user(&uuid, (void __user *)(arg + minsz), - sizeof(uuid))) - return -EFAULT; + if (copy_from_user(&uuid, arg, sizeof(uuid))) + return -EFAULT; - mutex_lock(&vdev->vf_token->lock); - uuid_copy(&vdev->vf_token->uuid, &uuid); - mutex_unlock(&vdev->vf_token->lock); + mutex_lock(&vdev->vf_token->lock); + uuid_copy(&vdev->vf_token->uuid, &uuid); + mutex_unlock(&vdev->vf_token->lock); + return 0; +} - return 0; - default: - return -ENOTTY; - } +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, + void __user *arg, size_t argsz) +{ + switch (flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN: + return vfio_pci_core_feature_token(device, flags, arg, argsz); + default: + return -ENOTTY; } - - return -ENOTTY; } -EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl); +EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl_feature); static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf, size_t count, loff_t *ppos, bool iswrite) @@ -1891,8 +1908,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev) } EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device); -static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, - pci_channel_state_t state) +pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev, + pci_channel_state_t state) { struct vfio_pci_core_device *vdev; struct vfio_device *device; @@ -1914,6 +1931,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, return PCI_ERS_RESULT_CAN_RECOVER; } +EXPORT_SYMBOL_GPL(vfio_pci_core_aer_err_detected); int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) { @@ -1936,7 +1954,7 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn) EXPORT_SYMBOL_GPL(vfio_pci_core_sriov_configure); const struct pci_error_handlers vfio_pci_core_err_handlers = { - .error_detected = vfio_pci_aer_err_detected, + .error_detected = vfio_pci_core_aer_err_detected, }; EXPORT_SYMBOL_GPL(vfio_pci_core_err_handlers); @@ -2055,6 +2073,18 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set, } cur_mem = NULL; + /* + * The pci_reset_bus() will reset all the devices in the bus. + * The power state can be non-D0 for some of the devices in the bus. + * For these devices, the pci_reset_bus() will internally set + * the power state to D0 without vfio driver involvement. + * For the devices which have NoSoftRst-, the reset function can + * cause the PCI config space reset without restoring the original + * state (saved locally in 'vdev->pm_save'). + */ + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + vfio_pci_set_power_state(cur, PCI_D0); + ret = pci_reset_bus(pdev); err_undo: @@ -2108,6 +2138,18 @@ static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set) if (!pdev) return false; + /* + * The pci_reset_bus() will reset all the devices in the bus. + * The power state can be non-D0 for some of the devices in the bus. + * For these devices, the pci_reset_bus() will internally set + * the power state to D0 without vfio driver involvement. + * For the devices which have NoSoftRst-, the reset function can + * cause the PCI config space reset without restoring the original + * state (saved locally in 'vdev->pm_save'). + */ + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) + vfio_pci_set_power_state(cur, PCI_D0); + ret = pci_reset_bus(pdev); if (ret) return false; diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 57d3b2cbbd8e..82ac1569deb0 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -288,6 +288,7 @@ out: return done; } +#ifdef CONFIG_VFIO_PCI_VGA ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf, size_t count, loff_t *ppos, bool iswrite) { @@ -355,6 +356,7 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf, return done; } +#endif static void vfio_pci_ioeventfd_do_write(struct vfio_pci_ioeventfd *ioeventfd, bool test_mem) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 735d1d344af9..a4555014bd1e 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -1557,15 +1557,303 @@ static int vfio_device_fops_release(struct inode *inode, struct file *filep) return 0; } +/* + * vfio_mig_get_next_state - Compute the next step in the FSM + * @cur_fsm - The current state the device is in + * @new_fsm - The target state to reach + * @next_fsm - Pointer to the next step to get to new_fsm + * + * Return 0 upon success, otherwise -errno + * Upon success the next step in the state progression between cur_fsm and + * new_fsm will be set in next_fsm. + * + * This breaks down requests for combination transitions into smaller steps and + * returns the next step to get to new_fsm. The function may need to be called + * multiple times before reaching new_fsm. + * + */ +int vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm, + enum vfio_device_mig_state *next_fsm) +{ + enum { VFIO_DEVICE_NUM_STATES = VFIO_DEVICE_STATE_RUNNING_P2P + 1 }; + /* + * The coding in this table requires the driver to implement the + * following FSM arcs: + * RESUMING -> STOP + * STOP -> RESUMING + * STOP -> STOP_COPY + * STOP_COPY -> STOP + * + * If P2P is supported then the driver must also implement these FSM + * arcs: + * RUNNING -> RUNNING_P2P + * RUNNING_P2P -> RUNNING + * RUNNING_P2P -> STOP + * STOP -> RUNNING_P2P + * Without P2P the driver must implement: + * RUNNING -> STOP + * STOP -> RUNNING + * + * The coding will step through multiple states for some combination + * transitions; if all optional features are supported, this means the + * following ones: + * RESUMING -> STOP -> RUNNING_P2P + * RESUMING -> STOP -> RUNNING_P2P -> RUNNING + * RESUMING -> STOP -> STOP_COPY + * RUNNING -> RUNNING_P2P -> STOP + * RUNNING -> RUNNING_P2P -> STOP -> RESUMING + * RUNNING -> RUNNING_P2P -> STOP -> STOP_COPY + * RUNNING_P2P -> STOP -> RESUMING + * RUNNING_P2P -> STOP -> STOP_COPY + * STOP -> RUNNING_P2P -> RUNNING + * STOP_COPY -> STOP -> RESUMING + * STOP_COPY -> STOP -> RUNNING_P2P + * STOP_COPY -> STOP -> RUNNING_P2P -> RUNNING + */ + static const u8 vfio_from_fsm_table[VFIO_DEVICE_NUM_STATES][VFIO_DEVICE_NUM_STATES] = { + [VFIO_DEVICE_STATE_STOP] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RUNNING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_STOP_COPY] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RESUMING] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_RESUMING, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_RUNNING_P2P] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_RUNNING, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_STOP, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_RUNNING_P2P, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + [VFIO_DEVICE_STATE_ERROR] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_RUNNING_P2P] = VFIO_DEVICE_STATE_ERROR, + [VFIO_DEVICE_STATE_ERROR] = VFIO_DEVICE_STATE_ERROR, + }, + }; + + static const unsigned int state_flags_table[VFIO_DEVICE_NUM_STATES] = { + [VFIO_DEVICE_STATE_STOP] = VFIO_MIGRATION_STOP_COPY, + [VFIO_DEVICE_STATE_RUNNING] = VFIO_MIGRATION_STOP_COPY, + [VFIO_DEVICE_STATE_STOP_COPY] = VFIO_MIGRATION_STOP_COPY, + [VFIO_DEVICE_STATE_RESUMING] = VFIO_MIGRATION_STOP_COPY, + [VFIO_DEVICE_STATE_RUNNING_P2P] = + VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P, + [VFIO_DEVICE_STATE_ERROR] = ~0U, + }; + + if (WARN_ON(cur_fsm >= ARRAY_SIZE(vfio_from_fsm_table) || + (state_flags_table[cur_fsm] & device->migration_flags) != + state_flags_table[cur_fsm])) + return -EINVAL; + + if (new_fsm >= ARRAY_SIZE(vfio_from_fsm_table) || + (state_flags_table[new_fsm] & device->migration_flags) != + state_flags_table[new_fsm]) + return -EINVAL; + + /* + * Arcs touching optional and unsupported states are skipped over. The + * driver will instead see an arc from the original state to the next + * logical state, as per the above comment. + */ + *next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm]; + while ((state_flags_table[*next_fsm] & device->migration_flags) != + state_flags_table[*next_fsm]) + *next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm]; + + return (*next_fsm != VFIO_DEVICE_STATE_ERROR) ? 0 : -EINVAL; +} +EXPORT_SYMBOL_GPL(vfio_mig_get_next_state); + +/* + * Convert the drivers's struct file into a FD number and return it to userspace + */ +static int vfio_ioct_mig_return_fd(struct file *filp, void __user *arg, + struct vfio_device_feature_mig_state *mig) +{ + int ret; + int fd; + + fd = get_unused_fd_flags(O_CLOEXEC); + if (fd < 0) { + ret = fd; + goto out_fput; + } + + mig->data_fd = fd; + if (copy_to_user(arg, mig, sizeof(*mig))) { + ret = -EFAULT; + goto out_put_unused; + } + fd_install(fd, filp); + return 0; + +out_put_unused: + put_unused_fd(fd); +out_fput: + fput(filp); + return ret; +} + +static int +vfio_ioctl_device_feature_mig_device_state(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + size_t minsz = + offsetofend(struct vfio_device_feature_mig_state, data_fd); + struct vfio_device_feature_mig_state mig; + struct file *filp = NULL; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, + VFIO_DEVICE_FEATURE_SET | + VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + + if (copy_from_user(&mig, arg, minsz)) + return -EFAULT; + + if (flags & VFIO_DEVICE_FEATURE_GET) { + enum vfio_device_mig_state curr_state; + + ret = device->ops->migration_get_state(device, &curr_state); + if (ret) + return ret; + mig.device_state = curr_state; + goto out_copy; + } + + /* Handle the VFIO_DEVICE_FEATURE_SET */ + filp = device->ops->migration_set_state(device, mig.device_state); + if (IS_ERR(filp) || !filp) + goto out_copy; + + return vfio_ioct_mig_return_fd(filp, arg, &mig); +out_copy: + mig.data_fd = -1; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + if (IS_ERR(filp)) + return PTR_ERR(filp); + return 0; +} + +static int vfio_ioctl_device_feature_migration(struct vfio_device *device, + u32 flags, void __user *arg, + size_t argsz) +{ + struct vfio_device_feature_migration mig = { + .flags = device->migration_flags, + }; + int ret; + + if (!device->ops->migration_set_state || + !device->ops->migration_get_state) + return -ENOTTY; + + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET, + sizeof(mig)); + if (ret != 1) + return ret; + if (copy_to_user(arg, &mig, sizeof(mig))) + return -EFAULT; + return 0; +} + +static int vfio_ioctl_device_feature(struct vfio_device *device, + struct vfio_device_feature __user *arg) +{ + size_t minsz = offsetofend(struct vfio_device_feature, flags); + struct vfio_device_feature feature; + + if (copy_from_user(&feature, arg, minsz)) + return -EFAULT; + + if (feature.argsz < minsz) + return -EINVAL; + + /* Check unknown flags */ + if (feature.flags & + ~(VFIO_DEVICE_FEATURE_MASK | VFIO_DEVICE_FEATURE_SET | + VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_PROBE)) + return -EINVAL; + + /* GET & SET are mutually exclusive except with PROBE */ + if (!(feature.flags & VFIO_DEVICE_FEATURE_PROBE) && + (feature.flags & VFIO_DEVICE_FEATURE_SET) && + (feature.flags & VFIO_DEVICE_FEATURE_GET)) + return -EINVAL; + + switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) { + case VFIO_DEVICE_FEATURE_MIGRATION: + return vfio_ioctl_device_feature_migration( + device, feature.flags, arg->data, + feature.argsz - minsz); + case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE: + return vfio_ioctl_device_feature_mig_device_state( + device, feature.flags, arg->data, + feature.argsz - minsz); + default: + if (unlikely(!device->ops->device_feature)) + return -EINVAL; + return device->ops->device_feature(device, feature.flags, + arg->data, + feature.argsz - minsz); + } +} + static long vfio_device_fops_unl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) { struct vfio_device *device = filep->private_data; - if (unlikely(!device->ops->ioctl)) - return -EINVAL; - - return device->ops->ioctl(device, cmd, arg); + switch (cmd) { + case VFIO_DEVICE_FEATURE: + return vfio_ioctl_device_feature(device, (void __user *)arg); + default: + if (unlikely(!device->ops->ioctl)) + return -EINVAL; + return device->ops->ioctl(device, cmd, arg); + } } static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf, diff --git a/drivers/crypto/hisilicon/qm.h b/include/linux/hisi_acc_qm.h index 3068093229a5..177f7b7cd414 100644 --- a/drivers/crypto/hisilicon/qm.h +++ b/include/linux/hisi_acc_qm.h @@ -34,6 +34,41 @@ #define QM_WUSER_M_CFG_ENABLE 0x1000a8 #define WUSER_M_CFG_ENABLE 0xffffffff +/* mailbox */ +#define QM_MB_CMD_SQC 0x0 +#define QM_MB_CMD_CQC 0x1 +#define QM_MB_CMD_EQC 0x2 +#define QM_MB_CMD_AEQC 0x3 +#define QM_MB_CMD_SQC_BT 0x4 +#define QM_MB_CMD_CQC_BT 0x5 +#define QM_MB_CMD_SQC_VFT_V2 0x6 +#define QM_MB_CMD_STOP_QP 0x8 +#define QM_MB_CMD_SRC 0xc +#define QM_MB_CMD_DST 0xd + +#define QM_MB_CMD_SEND_BASE 0x300 +#define QM_MB_EVENT_SHIFT 8 +#define QM_MB_BUSY_SHIFT 13 +#define QM_MB_OP_SHIFT 14 +#define QM_MB_CMD_DATA_ADDR_L 0x304 +#define QM_MB_CMD_DATA_ADDR_H 0x308 +#define QM_MB_MAX_WAIT_CNT 6000 + +/* doorbell */ +#define QM_DOORBELL_CMD_SQ 0 +#define QM_DOORBELL_CMD_CQ 1 +#define QM_DOORBELL_CMD_EQ 2 +#define QM_DOORBELL_CMD_AEQ 3 + +#define QM_DOORBELL_SQ_CQ_BASE_V2 0x1000 +#define QM_DOORBELL_EQ_AEQ_BASE_V2 0x2000 +#define QM_QP_MAX_NUM_SHIFT 11 +#define QM_DB_CMD_SHIFT_V2 12 +#define QM_DB_RAND_SHIFT_V2 16 +#define QM_DB_INDEX_SHIFT_V2 32 +#define QM_DB_PRIORITY_SHIFT_V2 48 +#define QM_VF_STATE 0x60 + /* qm cache */ #define QM_CACHE_CTL 0x100050 #define SQC_CACHE_ENABLE BIT(0) @@ -128,6 +163,11 @@ enum qm_debug_file { DEBUG_FILE_NUM, }; +enum qm_vf_state { + QM_READY = 0, + QM_NOT_READY, +}; + struct qm_dfx { atomic64_t err_irq_cnt; atomic64_t aeq_irq_cnt; @@ -414,6 +454,10 @@ pci_ers_result_t hisi_qm_dev_slot_reset(struct pci_dev *pdev); void hisi_qm_reset_prepare(struct pci_dev *pdev); void hisi_qm_reset_done(struct pci_dev *pdev); +int hisi_qm_wait_mb_ready(struct hisi_qm *qm); +int hisi_qm_mb(struct hisi_qm *qm, u8 cmd, dma_addr_t dma_addr, u16 queue, + bool op); + struct hisi_acc_sgl_pool; struct hisi_acc_hw_sgl *hisi_acc_sg_buf_map_to_hw_sgl(struct device *dev, struct scatterlist *sgl, struct hisi_acc_sgl_pool *pool, @@ -438,4 +482,9 @@ void hisi_qm_pm_init(struct hisi_qm *qm); int hisi_qm_get_dfx_access(struct hisi_qm *qm); void hisi_qm_put_dfx_access(struct hisi_qm *qm); void hisi_qm_regs_dump(struct seq_file *s, struct debugfs_regset32 *regset); + +/* Used by VFIO ACC live migration driver */ +struct pci_driver *hisi_sec_get_pf_driver(void); +struct pci_driver *hisi_hpre_get_pf_driver(void); +struct pci_driver *hisi_zip_get_pf_driver(void); #endif diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 78655d8d13a7..319322a8ff94 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1143,6 +1143,9 @@ int mlx5_dm_sw_icm_alloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type, int mlx5_dm_sw_icm_dealloc(struct mlx5_core_dev *dev, enum mlx5_sw_icm_type type, u64 length, u16 uid, phys_addr_t addr, u32 obj_id); +struct mlx5_core_dev *mlx5_vf_get_core_dev(struct pci_dev *pdev); +void mlx5_vf_put_core_dev(struct mlx5_core_dev *mdev); + #ifdef CONFIG_MLX5_CORE_IPOIB struct net_device *mlx5_rdma_netdev_alloc(struct mlx5_core_dev *mdev, struct ib_device *ibdev, diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 49a48d7709ac..0235054fe55c 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -127,6 +127,11 @@ enum { MLX5_CMD_OP_QUERY_SF_PARTITION = 0x111, MLX5_CMD_OP_ALLOC_SF = 0x113, MLX5_CMD_OP_DEALLOC_SF = 0x114, + MLX5_CMD_OP_SUSPEND_VHCA = 0x115, + MLX5_CMD_OP_RESUME_VHCA = 0x116, + MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE = 0x117, + MLX5_CMD_OP_SAVE_VHCA_STATE = 0x118, + MLX5_CMD_OP_LOAD_VHCA_STATE = 0x119, MLX5_CMD_OP_CREATE_MKEY = 0x200, MLX5_CMD_OP_QUERY_MKEY = 0x201, MLX5_CMD_OP_DESTROY_MKEY = 0x202, @@ -1757,7 +1762,9 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 reserved_at_682[0x1]; u8 log_max_sf[0x5]; u8 apu[0x1]; - u8 reserved_at_689[0x7]; + u8 reserved_at_689[0x4]; + u8 migration[0x1]; + u8 reserved_at_68e[0x2]; u8 log_min_sf_size[0x8]; u8 max_num_sf_partitions[0x8]; @@ -11518,4 +11525,142 @@ enum { MLX5_MTT_PERM_RW = MLX5_MTT_PERM_READ | MLX5_MTT_PERM_WRITE, }; +enum { + MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_INITIATOR = 0x0, + MLX5_SUSPEND_VHCA_IN_OP_MOD_SUSPEND_RESPONDER = 0x1, +}; + +struct mlx5_ifc_suspend_vhca_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 reserved_at_40[0x10]; + u8 vhca_id[0x10]; + + u8 reserved_at_60[0x20]; +}; + +struct mlx5_ifc_suspend_vhca_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; +}; + +enum { + MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_RESPONDER = 0x0, + MLX5_RESUME_VHCA_IN_OP_MOD_RESUME_INITIATOR = 0x1, +}; + +struct mlx5_ifc_resume_vhca_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 reserved_at_40[0x10]; + u8 vhca_id[0x10]; + + u8 reserved_at_60[0x20]; +}; + +struct mlx5_ifc_resume_vhca_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; +}; + +struct mlx5_ifc_query_vhca_migration_state_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 reserved_at_40[0x10]; + u8 vhca_id[0x10]; + + u8 reserved_at_60[0x20]; +}; + +struct mlx5_ifc_query_vhca_migration_state_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; + + u8 required_umem_size[0x20]; + + u8 reserved_at_a0[0x160]; +}; + +struct mlx5_ifc_save_vhca_state_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 reserved_at_40[0x10]; + u8 vhca_id[0x10]; + + u8 reserved_at_60[0x20]; + + u8 va[0x40]; + + u8 mkey[0x20]; + + u8 size[0x20]; +}; + +struct mlx5_ifc_save_vhca_state_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 actual_image_size[0x20]; + + u8 reserved_at_60[0x20]; +}; + +struct mlx5_ifc_load_vhca_state_in_bits { + u8 opcode[0x10]; + u8 uid[0x10]; + + u8 reserved_at_20[0x10]; + u8 op_mod[0x10]; + + u8 reserved_at_40[0x10]; + u8 vhca_id[0x10]; + + u8 reserved_at_60[0x20]; + + u8 va[0x40]; + + u8 mkey[0x20]; + + u8 size[0x20]; +}; + +struct mlx5_ifc_load_vhca_state_out_bits { + u8 status[0x8]; + u8 reserved_at_8[0x18]; + + u8 syndrome[0x20]; + + u8 reserved_at_40[0x40]; +}; + #endif /* MLX5_IFC_H */ diff --git a/include/linux/pci.h b/include/linux/pci.h index 8253a5413d7c..60d423d8f0c4 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -2166,7 +2166,8 @@ void __iomem *pci_ioremap_wc_bar(struct pci_dev *pdev, int bar); #ifdef CONFIG_PCI_IOV int pci_iov_virtfn_bus(struct pci_dev *dev, int id); int pci_iov_virtfn_devfn(struct pci_dev *dev, int id); - +int pci_iov_vf_id(struct pci_dev *dev); +void *pci_iov_get_pf_drvdata(struct pci_dev *dev, struct pci_driver *pf_driver); int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); void pci_disable_sriov(struct pci_dev *dev); @@ -2194,6 +2195,18 @@ static inline int pci_iov_virtfn_devfn(struct pci_dev *dev, int id) { return -ENOSYS; } + +static inline int pci_iov_vf_id(struct pci_dev *dev) +{ + return -ENOSYS; +} + +static inline void *pci_iov_get_pf_drvdata(struct pci_dev *dev, + struct pci_driver *pf_driver) +{ + return ERR_PTR(-EINVAL); +} + static inline int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn) { return -ENODEV; } diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h index aad54c666407..31dee2b65a62 100644 --- a/include/linux/pci_ids.h +++ b/include/linux/pci_ids.h @@ -2529,6 +2529,9 @@ #define PCI_DEVICE_ID_KORENIX_JETCARDF3 0x17ff #define PCI_VENDOR_ID_HUAWEI 0x19e5 +#define PCI_DEVICE_ID_HUAWEI_ZIP_VF 0xa251 +#define PCI_DEVICE_ID_HUAWEI_SEC_VF 0xa256 +#define PCI_DEVICE_ID_HUAWEI_HPRE_VF 0xa259 #define PCI_VENDOR_ID_NETRONOME 0x19ee #define PCI_DEVICE_ID_NETRONOME_NFP4000 0x4000 diff --git a/include/linux/vfio.h b/include/linux/vfio.h index 76191d7abed1..66dda06ec42d 100644 --- a/include/linux/vfio.h +++ b/include/linux/vfio.h @@ -33,6 +33,7 @@ struct vfio_device { struct vfio_group *group; struct vfio_device_set *dev_set; struct list_head dev_set_list; + unsigned int migration_flags; /* Members below here are private, not for driver use */ refcount_t refcount; @@ -55,6 +56,17 @@ struct vfio_device { * @match: Optional device name match callback (return: 0 for no-match, >0 for * match, -errno for abort (ex. match with insufficient or incorrect * additional args) + * @device_feature: Optional, fill in the VFIO_DEVICE_FEATURE ioctl + * @migration_set_state: Optional callback to change the migration state for + * devices that support migration. It's mandatory for + * VFIO_DEVICE_FEATURE_MIGRATION migration support. + * The returned FD is used for data transfer according to the FSM + * definition. The driver is responsible to ensure that FD reaches end + * of stream or error whenever the migration FSM leaves a data transfer + * state or before close_device() returns. + * @migration_get_state: Optional callback to get the migration state for + * devices that support migration. It's mandatory for + * VFIO_DEVICE_FEATURE_MIGRATION migration support. */ struct vfio_device_ops { char *name; @@ -69,8 +81,44 @@ struct vfio_device_ops { int (*mmap)(struct vfio_device *vdev, struct vm_area_struct *vma); void (*request)(struct vfio_device *vdev, unsigned int count); int (*match)(struct vfio_device *vdev, char *buf); + int (*device_feature)(struct vfio_device *device, u32 flags, + void __user *arg, size_t argsz); + struct file *(*migration_set_state)( + struct vfio_device *device, + enum vfio_device_mig_state new_state); + int (*migration_get_state)(struct vfio_device *device, + enum vfio_device_mig_state *curr_state); }; +/** + * vfio_check_feature - Validate user input for the VFIO_DEVICE_FEATURE ioctl + * @flags: Arg from the device_feature op + * @argsz: Arg from the device_feature op + * @supported_ops: Combination of VFIO_DEVICE_FEATURE_GET and SET the driver + * supports + * @minsz: Minimum data size the driver accepts + * + * For use in a driver's device_feature op. Checks that the inputs to the + * VFIO_DEVICE_FEATURE ioctl are correct for the driver's feature. Returns 1 if + * the driver should execute the get or set, otherwise the relevant + * value should be returned. + */ +static inline int vfio_check_feature(u32 flags, size_t argsz, u32 supported_ops, + size_t minsz) +{ + if ((flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET)) & + ~supported_ops) + return -EINVAL; + if (flags & VFIO_DEVICE_FEATURE_PROBE) + return 0; + /* Without PROBE one of GET or SET must be requested */ + if (!(flags & (VFIO_DEVICE_FEATURE_GET | VFIO_DEVICE_FEATURE_SET))) + return -EINVAL; + if (argsz < minsz) + return -EINVAL; + return 1; +} + void vfio_init_group_dev(struct vfio_device *device, struct device *dev, const struct vfio_device_ops *ops); void vfio_uninit_group_dev(struct vfio_device *device); @@ -82,6 +130,11 @@ extern void vfio_device_put(struct vfio_device *device); int vfio_assign_device_set(struct vfio_device *device, void *set_id); +int vfio_mig_get_next_state(struct vfio_device *device, + enum vfio_device_mig_state cur_fsm, + enum vfio_device_mig_state new_fsm, + enum vfio_device_mig_state *next_fsm); + /* * External user API */ diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index ef9a44b6cf5d..74a4a0f17b28 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -159,8 +159,17 @@ extern ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, extern ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf, size_t count, loff_t *ppos, bool iswrite); +#ifdef CONFIG_VFIO_PCI_VGA extern ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf, size_t count, loff_t *ppos, bool iswrite); +#else +static inline ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, + char __user *buf, size_t count, + loff_t *ppos, bool iswrite) +{ + return -EINVAL; +} +#endif extern long vfio_pci_ioeventfd(struct vfio_pci_core_device *vdev, loff_t offset, uint64_t data, int count, int fd); @@ -220,6 +229,8 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn); extern const struct pci_error_handlers vfio_pci_core_err_handlers; long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd, unsigned long arg); +int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags, + void __user *arg, size_t argsz); ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf, size_t count, loff_t *ppos); ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf, @@ -230,6 +241,8 @@ int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf); int vfio_pci_core_enable(struct vfio_pci_core_device *vdev); void vfio_pci_core_disable(struct vfio_pci_core_device *vdev); void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev); +pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev, + pci_channel_state_t state); static inline bool vfio_pci_is_vga(struct pci_dev *pdev) { diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index ef33ea002b0b..fea86061b44e 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -323,7 +323,7 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff) #define VFIO_REGION_TYPE_GFX (1) #define VFIO_REGION_TYPE_CCW (2) -#define VFIO_REGION_TYPE_MIGRATION (3) +#define VFIO_REGION_TYPE_MIGRATION_DEPRECATED (3) /* sub-types for VFIO_REGION_TYPE_PCI_* */ @@ -405,225 +405,29 @@ struct vfio_region_gfx_edid { #define VFIO_REGION_SUBTYPE_CCW_CRW (3) /* sub-types for VFIO_REGION_TYPE_MIGRATION */ -#define VFIO_REGION_SUBTYPE_MIGRATION (1) - -/* - * The structure vfio_device_migration_info is placed at the 0th offset of - * the VFIO_REGION_SUBTYPE_MIGRATION region to get and set VFIO device related - * migration information. Field accesses from this structure are only supported - * at their native width and alignment. Otherwise, the result is undefined and - * vendor drivers should return an error. - * - * device_state: (read/write) - * - The user application writes to this field to inform the vendor driver - * about the device state to be transitioned to. - * - The vendor driver should take the necessary actions to change the - * device state. After successful transition to a given state, the - * vendor driver should return success on write(device_state, state) - * system call. If the device state transition fails, the vendor driver - * should return an appropriate -errno for the fault condition. - * - On the user application side, if the device state transition fails, - * that is, if write(device_state, state) returns an error, read - * device_state again to determine the current state of the device from - * the vendor driver. - * - The vendor driver should return previous state of the device unless - * the vendor driver has encountered an internal error, in which case - * the vendor driver may report the device_state VFIO_DEVICE_STATE_ERROR. - * - The user application must use the device reset ioctl to recover the - * device from VFIO_DEVICE_STATE_ERROR state. If the device is - * indicated to be in a valid device state by reading device_state, the - * user application may attempt to transition the device to any valid - * state reachable from the current state or terminate itself. - * - * device_state consists of 3 bits: - * - If bit 0 is set, it indicates the _RUNNING state. If bit 0 is clear, - * it indicates the _STOP state. When the device state is changed to - * _STOP, driver should stop the device before write() returns. - * - If bit 1 is set, it indicates the _SAVING state, which means that the - * driver should start gathering device state information that will be - * provided to the VFIO user application to save the device's state. - * - If bit 2 is set, it indicates the _RESUMING state, which means that - * the driver should prepare to resume the device. Data provided through - * the migration region should be used to resume the device. - * Bits 3 - 31 are reserved for future use. To preserve them, the user - * application should perform a read-modify-write operation on this - * field when modifying the specified bits. - * - * +------- _RESUMING - * |+------ _SAVING - * ||+----- _RUNNING - * ||| - * 000b => Device Stopped, not saving or resuming - * 001b => Device running, which is the default state - * 010b => Stop the device & save the device state, stop-and-copy state - * 011b => Device running and save the device state, pre-copy state - * 100b => Device stopped and the device state is resuming - * 101b => Invalid state - * 110b => Error state - * 111b => Invalid state - * - * State transitions: - * - * _RESUMING _RUNNING Pre-copy Stop-and-copy _STOP - * (100b) (001b) (011b) (010b) (000b) - * 0. Running or default state - * | - * - * 1. Normal Shutdown (optional) - * |------------------------------------->| - * - * 2. Save the state or suspend - * |------------------------->|---------->| - * - * 3. Save the state during live migration - * |----------->|------------>|---------->| - * - * 4. Resuming - * |<---------| - * - * 5. Resumed - * |--------->| - * - * 0. Default state of VFIO device is _RUNNING when the user application starts. - * 1. During normal shutdown of the user application, the user application may - * optionally change the VFIO device state from _RUNNING to _STOP. This - * transition is optional. The vendor driver must support this transition but - * must not require it. - * 2. When the user application saves state or suspends the application, the - * device state transitions from _RUNNING to stop-and-copy and then to _STOP. - * On state transition from _RUNNING to stop-and-copy, driver must stop the - * device, save the device state and send it to the application through the - * migration region. The sequence to be followed for such transition is given - * below. - * 3. In live migration of user application, the state transitions from _RUNNING - * to pre-copy, to stop-and-copy, and to _STOP. - * On state transition from _RUNNING to pre-copy, the driver should start - * gathering the device state while the application is still running and send - * the device state data to application through the migration region. - * On state transition from pre-copy to stop-and-copy, the driver must stop - * the device, save the device state and send it to the user application - * through the migration region. - * Vendor drivers must support the pre-copy state even for implementations - * where no data is provided to the user before the stop-and-copy state. The - * user must not be required to consume all migration data before the device - * transitions to a new state, including the stop-and-copy state. - * The sequence to be followed for above two transitions is given below. - * 4. To start the resuming phase, the device state should be transitioned from - * the _RUNNING to the _RESUMING state. - * In the _RESUMING state, the driver should use the device state data - * received through the migration region to resume the device. - * 5. After providing saved device data to the driver, the application should - * change the state from _RESUMING to _RUNNING. - * - * reserved: - * Reads on this field return zero and writes are ignored. - * - * pending_bytes: (read only) - * The number of pending bytes still to be migrated from the vendor driver. - * - * data_offset: (read only) - * The user application should read data_offset field from the migration - * region. The user application should read the device data from this - * offset within the migration region during the _SAVING state or write - * the device data during the _RESUMING state. See below for details of - * sequence to be followed. - * - * data_size: (read/write) - * The user application should read data_size to get the size in bytes of - * the data copied in the migration region during the _SAVING state and - * write the size in bytes of the data copied in the migration region - * during the _RESUMING state. - * - * The format of the migration region is as follows: - * ------------------------------------------------------------------ - * |vfio_device_migration_info| data section | - * | | /////////////////////////////// | - * ------------------------------------------------------------------ - * ^ ^ - * offset 0-trapped part data_offset - * - * The structure vfio_device_migration_info is always followed by the data - * section in the region, so data_offset will always be nonzero. The offset - * from where the data is copied is decided by the kernel driver. The data - * section can be trapped, mmapped, or partitioned, depending on how the kernel - * driver defines the data section. The data section partition can be defined - * as mapped by the sparse mmap capability. If mmapped, data_offset must be - * page aligned, whereas initial section which contains the - * vfio_device_migration_info structure, might not end at the offset, which is - * page aligned. The user is not required to access through mmap regardless - * of the capabilities of the region mmap. - * The vendor driver should determine whether and how to partition the data - * section. The vendor driver should return data_offset accordingly. - * - * The sequence to be followed while in pre-copy state and stop-and-copy state - * is as follows: - * a. Read pending_bytes, indicating the start of a new iteration to get device - * data. Repeated read on pending_bytes at this stage should have no side - * effects. - * If pending_bytes == 0, the user application should not iterate to get data - * for that device. - * If pending_bytes > 0, perform the following steps. - * b. Read data_offset, indicating that the vendor driver should make data - * available through the data section. The vendor driver should return this - * read operation only after data is available from (region + data_offset) - * to (region + data_offset + data_size). - * c. Read data_size, which is the amount of data in bytes available through - * the migration region. - * Read on data_offset and data_size should return the offset and size of - * the current buffer if the user application reads data_offset and - * data_size more than once here. - * d. Read data_size bytes of data from (region + data_offset) from the - * migration region. - * e. Process the data. - * f. Read pending_bytes, which indicates that the data from the previous - * iteration has been read. If pending_bytes > 0, go to step b. - * - * The user application can transition from the _SAVING|_RUNNING - * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the - * number of pending bytes. The user application should iterate in _SAVING - * (stop-and-copy) until pending_bytes is 0. - * - * The sequence to be followed while _RESUMING device state is as follows: - * While data for this device is available, repeat the following steps: - * a. Read data_offset from where the user application should write data. - * b. Write migration data starting at the migration region + data_offset for - * the length determined by data_size from the migration source. - * c. Write data_size, which indicates to the vendor driver that data is - * written in the migration region. Vendor driver must return this write - * operations on consuming data. Vendor driver should apply the - * user-provided migration region data to the device resume state. - * - * If an error occurs during the above sequences, the vendor driver can return - * an error code for next read() or write() operation, which will terminate the - * loop. The user application should then take the next necessary action, for - * example, failing migration or terminating the user application. - * - * For the user application, data is opaque. The user application should write - * data in the same order as the data is received and the data should be of - * same transaction size at the source. - */ +#define VFIO_REGION_SUBTYPE_MIGRATION_DEPRECATED (1) struct vfio_device_migration_info { __u32 device_state; /* VFIO device state */ -#define VFIO_DEVICE_STATE_STOP (0) -#define VFIO_DEVICE_STATE_RUNNING (1 << 0) -#define VFIO_DEVICE_STATE_SAVING (1 << 1) -#define VFIO_DEVICE_STATE_RESUMING (1 << 2) -#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_RUNNING | \ - VFIO_DEVICE_STATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING) +#define VFIO_DEVICE_STATE_V1_STOP (0) +#define VFIO_DEVICE_STATE_V1_RUNNING (1 << 0) +#define VFIO_DEVICE_STATE_V1_SAVING (1 << 1) +#define VFIO_DEVICE_STATE_V1_RESUMING (1 << 2) +#define VFIO_DEVICE_STATE_MASK (VFIO_DEVICE_STATE_V1_RUNNING | \ + VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING) #define VFIO_DEVICE_STATE_VALID(state) \ - (state & VFIO_DEVICE_STATE_RESUMING ? \ - (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1) + (state & VFIO_DEVICE_STATE_V1_RESUMING ? \ + (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_V1_RESUMING : 1) #define VFIO_DEVICE_STATE_IS_ERROR(state) \ - ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING)) + ((state & VFIO_DEVICE_STATE_MASK) == (VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING)) #define VFIO_DEVICE_STATE_SET_ERROR(state) \ - ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_SATE_SAVING | \ - VFIO_DEVICE_STATE_RESUMING) + ((state & ~VFIO_DEVICE_STATE_MASK) | VFIO_DEVICE_STATE_V1_SAVING | \ + VFIO_DEVICE_STATE_V1_RESUMING) __u32 reserved; __u64 pending_bytes; @@ -1002,6 +806,186 @@ struct vfio_device_feature { */ #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN (0) +/* + * Indicates the device can support the migration API through + * VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE. If this GET succeeds, the RUNNING and + * ERROR states are always supported. Support for additional states is + * indicated via the flags field; at least VFIO_MIGRATION_STOP_COPY must be + * set. + * + * VFIO_MIGRATION_STOP_COPY means that STOP, STOP_COPY and + * RESUMING are supported. + * + * VFIO_MIGRATION_STOP_COPY | VFIO_MIGRATION_P2P means that RUNNING_P2P + * is supported in addition to the STOP_COPY states. + * + * Other combinations of flags have behavior to be defined in the future. + */ +struct vfio_device_feature_migration { + __aligned_u64 flags; +#define VFIO_MIGRATION_STOP_COPY (1 << 0) +#define VFIO_MIGRATION_P2P (1 << 1) +}; +#define VFIO_DEVICE_FEATURE_MIGRATION 1 + +/* + * Upon VFIO_DEVICE_FEATURE_SET, execute a migration state change on the VFIO + * device. The new state is supplied in device_state, see enum + * vfio_device_mig_state for details + * + * The kernel migration driver must fully transition the device to the new state + * value before the operation returns to the user. + * + * The kernel migration driver must not generate asynchronous device state + * transitions outside of manipulation by the user or the VFIO_DEVICE_RESET + * ioctl as described above. + * + * If this function fails then current device_state may be the original + * operating state or some other state along the combination transition path. + * The user can then decide if it should execute a VFIO_DEVICE_RESET, attempt + * to return to the original state, or attempt to return to some other state + * such as RUNNING or STOP. + * + * If the new_state starts a new data transfer session then the FD associated + * with that session is returned in data_fd. The user is responsible to close + * this FD when it is finished. The user must consider the migration data stream + * carried over the FD to be opaque and must preserve the byte order of the + * stream. The user is not required to preserve buffer segmentation when writing + * the data stream during the RESUMING operation. + * + * Upon VFIO_DEVICE_FEATURE_GET, get the current migration state of the VFIO + * device, data_fd will be -1. + */ +struct vfio_device_feature_mig_state { + __u32 device_state; /* From enum vfio_device_mig_state */ + __s32 data_fd; +}; +#define VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE 2 + +/* + * The device migration Finite State Machine is described by the enum + * vfio_device_mig_state. Some of the FSM arcs will create a migration data + * transfer session by returning a FD, in this case the migration data will + * flow over the FD using read() and write() as discussed below. + * + * There are 5 states to support VFIO_MIGRATION_STOP_COPY: + * RUNNING - The device is running normally + * STOP - The device does not change the internal or external state + * STOP_COPY - The device internal state can be read out + * RESUMING - The device is stopped and is loading a new internal state + * ERROR - The device has failed and must be reset + * + * And 1 optional state to support VFIO_MIGRATION_P2P: + * RUNNING_P2P - RUNNING, except the device cannot do peer to peer DMA + * + * The FSM takes actions on the arcs between FSM states. The driver implements + * the following behavior for the FSM arcs: + * + * RUNNING_P2P -> STOP + * STOP_COPY -> STOP + * While in STOP the device must stop the operation of the device. The device + * must not generate interrupts, DMA, or any other change to external state. + * It must not change its internal state. When stopped the device and kernel + * migration driver must accept and respond to interaction to support external + * subsystems in the STOP state, for example PCI MSI-X and PCI config space. + * Failure by the user to restrict device access while in STOP must not result + * in error conditions outside the user context (ex. host system faults). + * + * The STOP_COPY arc will terminate a data transfer session. + * + * RESUMING -> STOP + * Leaving RESUMING terminates a data transfer session and indicates the + * device should complete processing of the data delivered by write(). The + * kernel migration driver should complete the incorporation of data written + * to the data transfer FD into the device internal state and perform + * final validity and consistency checking of the new device state. If the + * user provided data is found to be incomplete, inconsistent, or otherwise + * invalid, the migration driver must fail the SET_STATE ioctl and + * optionally go to the ERROR state as described below. + * + * While in STOP the device has the same behavior as other STOP states + * described above. + * + * To abort a RESUMING session the device must be reset. + * + * RUNNING_P2P -> RUNNING + * While in RUNNING the device is fully operational, the device may generate + * interrupts, DMA, respond to MMIO, all vfio device regions are functional, + * and the device may advance its internal state. + * + * RUNNING -> RUNNING_P2P + * STOP -> RUNNING_P2P + * While in RUNNING_P2P the device is partially running in the P2P quiescent + * state defined below. + * + * STOP -> STOP_COPY + * This arc begin the process of saving the device state and will return a + * new data_fd. + * + * While in the STOP_COPY state the device has the same behavior as STOP + * with the addition that the data transfers session continues to stream the + * migration state. End of stream on the FD indicates the entire device + * state has been transferred. + * + * The user should take steps to restrict access to vfio device regions while + * the device is in STOP_COPY or risk corruption of the device migration data + * stream. + * + * STOP -> RESUMING + * Entering the RESUMING state starts a process of restoring the device state + * and will return a new data_fd. The data stream fed into the data_fd should + * be taken from the data transfer output of a single FD during saving from + * a compatible device. The migration driver may alter/reset the internal + * device state for this arc if required to prepare the device to receive the + * migration data. + * + * any -> ERROR + * ERROR cannot be specified as a device state, however any transition request + * can be failed with an errno return and may then move the device_state into + * ERROR. In this case the device was unable to execute the requested arc and + * was also unable to restore the device to any valid device_state. + * To recover from ERROR VFIO_DEVICE_RESET must be used to return the + * device_state back to RUNNING. + * + * The optional peer to peer (P2P) quiescent state is intended to be a quiescent + * state for the device for the purposes of managing multiple devices within a + * user context where peer-to-peer DMA between devices may be active. The + * RUNNING_P2P states must prevent the device from initiating + * any new P2P DMA transactions. If the device can identify P2P transactions + * then it can stop only P2P DMA, otherwise it must stop all DMA. The migration + * driver must complete any such outstanding operations prior to completing the + * FSM arc into a P2P state. For the purpose of specification the states + * behave as though the device was fully running if not supported. Like while in + * STOP or STOP_COPY the user must not touch the device, otherwise the state + * can be exited. + * + * The remaining possible transitions are interpreted as combinations of the + * above FSM arcs. As there are multiple paths through the FSM arcs the path + * should be selected based on the following rules: + * - Select the shortest path. + * Refer to vfio_mig_get_next_state() for the result of the algorithm. + * + * The automatic transit through the FSM arcs that make up the combination + * transition is invisible to the user. When working with combination arcs the + * user may see any step along the path in the device_state if SET_STATE + * fails. When handling these types of errors users should anticipate future + * revisions of this protocol using new states and those states becoming + * visible in this case. + * + * The optional states cannot be used with SET_STATE if the device does not + * support them. The user can discover if these states are supported by using + * VFIO_DEVICE_FEATURE_MIGRATION. By using combination transitions the user can + * avoid knowing about these optional states if the kernel driver supports them. + */ +enum vfio_device_mig_state { + VFIO_DEVICE_STATE_ERROR = 0, + VFIO_DEVICE_STATE_STOP = 1, + VFIO_DEVICE_STATE_RUNNING = 2, + VFIO_DEVICE_STATE_STOP_COPY = 3, + VFIO_DEVICE_STATE_RESUMING = 4, + VFIO_DEVICE_STATE_RUNNING_P2P = 5, +}; + /* -------- API for Type1 VFIO IOMMU -------- */ /** |