From 5f1f79bbc9e26fa9412fa9522f957bb8f030c442 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Thu, 7 May 2020 16:01:25 +0200 Subject: virtio-mem: Paravirtualized memory hotplug Each virtio-mem device owns exactly one memory region. It is responsible for adding/removing memory from that memory region on request. When the device driver starts up, the requested amount of memory is queried and then plugged to Linux. On request, further memory can be plugged or unplugged. This patch only implements the plugging part. On x86-64, memory can currently be plugged in 4MB ("subblock") granularity. When required, a new memory block will be added (e.g., usually 128MB on x86-64) in order to plug more subblocks. Only x86-64 was tested for now. The online_page callback is used to keep unplugged subblocks offline when onlining memory - similar to the Hyper-V balloon driver. Unplugged pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to skip them. User space is usually responsible for onlining the added memory. The memory hotplug notifier is used to synchronize virtio-mem activity against memory onlining/offlining. Each virtio-mem device can belong to a NUMA node, which allows us to easily add/remove small chunks of memory to/from a specific NUMA node by using multiple virtio-mem devices. Something that works even when the guest has no idea about the NUMA topology. One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many "sub-DIMMS". This patch directly introduces the basic infrastructure to implement memory unplug. Especially the memory block states and subblock bitmaps will be heavily used there. Notes: - In case memory is to be onlined by user space, we limit the amount of offline memory blocks, to not run out of memory. This is esp. an issue if memory is added faster than it is getting onlined. - Suspend/Hibernate is not supported due to the way virtio-mem devices behave. Limited support might be possible in the future. - Reloading the device driver is not supported. Reviewed-by: Pankaj Gupta Tested-by: Pankaj Gupta Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Oscar Salvador Cc: Michal Hocko Cc: Igor Mammedov Cc: Dave Young Cc: Andrew Morton Cc: Dan Williams Cc: Pavel Tatashin Cc: Stefan Hajnoczi Cc: Vlastimil Babka Cc: "Rafael J. Wysocki" Cc: Len Brown Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.com Signed-off-by: Michael S. Tsirkin --- include/uapi/linux/virtio_mem.h | 200 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 200 insertions(+) create mode 100644 include/uapi/linux/virtio_mem.h (limited to 'include/uapi/linux/virtio_mem.h') diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h new file mode 100644 index 000000000000..1bfade78bdfd --- /dev/null +++ b/include/uapi/linux/virtio_mem.h @@ -0,0 +1,200 @@ +/* SPDX-License-Identifier: BSD-3-Clause */ +/* + * Virtio Mem Device + * + * Copyright Red Hat, Inc. 2020 + * + * Authors: + * David Hildenbrand + * + * This header is BSD licensed so anyone can use the definitions + * to implement compatible drivers/servers: + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * 3. Neither the name of IBM nor the names of its contributors + * may be used to endorse or promote products derived from this software + * without specific prior written permission. + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL IBM OR + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +#ifndef _LINUX_VIRTIO_MEM_H +#define _LINUX_VIRTIO_MEM_H + +#include +#include +#include +#include + +/* + * Each virtio-mem device manages a dedicated region in physical address + * space. Each device can belong to a single NUMA node, multiple devices + * for a single NUMA node are possible. A virtio-mem device is like a + * "resizable DIMM" consisting of small memory blocks that can be plugged + * or unplugged. The device driver is responsible for (un)plugging memory + * blocks on demand. + * + * Virtio-mem devices can only operate on their assigned memory region in + * order to (un)plug memory. A device cannot (un)plug memory belonging to + * other devices. + * + * The "region_size" corresponds to the maximum amount of memory that can + * be provided by a device. The "size" corresponds to the amount of memory + * that is currently plugged. "requested_size" corresponds to a request + * from the device to the device driver to (un)plug blocks. The + * device driver should try to (un)plug blocks in order to reach the + * "requested_size". It is impossible to plug more memory than requested. + * + * The "usable_region_size" represents the memory region that can actually + * be used to (un)plug memory. It is always at least as big as the + * "requested_size" and will grow dynamically. It will only shrink when + * explicitly triggered (VIRTIO_MEM_REQ_UNPLUG). + * + * There are no guarantees what will happen if unplugged memory is + * read/written. Such memory should, in general, not be touched. E.g., + * even writing might succeed, but the values will simply be discarded at + * random points in time. + * + * It can happen that the device cannot process a request, because it is + * busy. The device driver has to retry later. + * + * Usually, during system resets all memory will get unplugged, so the + * device driver can start with a clean state. However, in specific + * scenarios (if the device is busy) it can happen that the device still + * has memory plugged. The device driver can request to unplug all memory + * (VIRTIO_MEM_REQ_UNPLUG) - which might take a while to succeed if the + * device is busy. + */ + +/* --- virtio-mem: guest -> host requests --- */ + +/* request to plug memory blocks */ +#define VIRTIO_MEM_REQ_PLUG 0 +/* request to unplug memory blocks */ +#define VIRTIO_MEM_REQ_UNPLUG 1 +/* request to unplug all blocks and shrink the usable size */ +#define VIRTIO_MEM_REQ_UNPLUG_ALL 2 +/* request information about the plugged state of memory blocks */ +#define VIRTIO_MEM_REQ_STATE 3 + +struct virtio_mem_req_plug { + __virtio64 addr; + __virtio16 nb_blocks; +}; + +struct virtio_mem_req_unplug { + __virtio64 addr; + __virtio16 nb_blocks; +}; + +struct virtio_mem_req_state { + __virtio64 addr; + __virtio16 nb_blocks; +}; + +struct virtio_mem_req { + __virtio16 type; + __virtio16 padding[3]; + + union { + struct virtio_mem_req_plug plug; + struct virtio_mem_req_unplug unplug; + struct virtio_mem_req_state state; + } u; +}; + + +/* --- virtio-mem: host -> guest response --- */ + +/* + * Request processed successfully, applicable for + * - VIRTIO_MEM_REQ_PLUG + * - VIRTIO_MEM_REQ_UNPLUG + * - VIRTIO_MEM_REQ_UNPLUG_ALL + * - VIRTIO_MEM_REQ_STATE + */ +#define VIRTIO_MEM_RESP_ACK 0 +/* + * Request denied - e.g. trying to plug more than requested, applicable for + * - VIRTIO_MEM_REQ_PLUG + */ +#define VIRTIO_MEM_RESP_NACK 1 +/* + * Request cannot be processed right now, try again later, applicable for + * - VIRTIO_MEM_REQ_PLUG + * - VIRTIO_MEM_REQ_UNPLUG + * - VIRTIO_MEM_REQ_UNPLUG_ALL + */ +#define VIRTIO_MEM_RESP_BUSY 2 +/* + * Error in request (e.g. addresses/alignment), applicable for + * - VIRTIO_MEM_REQ_PLUG + * - VIRTIO_MEM_REQ_UNPLUG + * - VIRTIO_MEM_REQ_STATE + */ +#define VIRTIO_MEM_RESP_ERROR 3 + + +/* State of memory blocks is "plugged" */ +#define VIRTIO_MEM_STATE_PLUGGED 0 +/* State of memory blocks is "unplugged" */ +#define VIRTIO_MEM_STATE_UNPLUGGED 1 +/* State of memory blocks is "mixed" */ +#define VIRTIO_MEM_STATE_MIXED 2 + +struct virtio_mem_resp_state { + __virtio16 state; +}; + +struct virtio_mem_resp { + __virtio16 type; + __virtio16 padding[3]; + + union { + struct virtio_mem_resp_state state; + } u; +}; + +/* --- virtio-mem: configuration --- */ + +struct virtio_mem_config { + /* Block size and alignment. Cannot change. */ + __u32 block_size; + __u32 padding; + /* Start address of the memory region. Cannot change. */ + __u64 addr; + /* Region size (maximum). Cannot change. */ + __u64 region_size; + /* + * Currently usable region size. Can grow up to region_size. Can + * shrink due to VIRTIO_MEM_REQ_UNPLUG_ALL (in which case no config + * update will be sent). + */ + __u64 usable_region_size; + /* + * Currently used size. Changes due to plug/unplug requests, but no + * config updates will be sent. + */ + __u64 plugged_size; + /* Requested size. New plug requests cannot exceed it. Can change. */ + __u64 requested_size; +}; + +#endif /* _LINUX_VIRTIO_MEM_H */ -- cgit v1.2.3 From f2af6d3978d74a7891d0f428537b4494498202cb Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Thu, 7 May 2020 16:01:27 +0200 Subject: virtio-mem: Allow to specify an ACPI PXM as nid We want to allow to specify (similar as for a DIMM), to which node a virtio-mem device (and, therefore, its memory) belongs. Add a new virtio-mem feature flag and export pxm_to_node, so it can be used in kernel module context. Acked-by: Michal Hocko # for the export Acked-by: "Rafael J. Wysocki" # for the export Acked-by: Pankaj Gupta Tested-by: Pankaj Gupta Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Oscar Salvador Cc: Michal Hocko Cc: Igor Mammedov Cc: Dave Young Cc: Andrew Morton Cc: Dan Williams Cc: Pavel Tatashin Cc: Stefan Hajnoczi Cc: Vlastimil Babka Cc: Len Brown Cc: linux-acpi@vger.kernel.org Signed-off-by: David Hildenbrand Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.com Signed-off-by: Michael S. Tsirkin --- drivers/acpi/numa/srat.c | 1 + drivers/virtio/virtio_mem.c | 39 +++++++++++++++++++++++++++++++++++++-- include/uapi/linux/virtio_mem.h | 10 +++++++++- 3 files changed, 47 insertions(+), 3 deletions(-) (limited to 'include/uapi/linux/virtio_mem.h') diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c index 47b4969d9b93..5be5a977da1b 100644 --- a/drivers/acpi/numa/srat.c +++ b/drivers/acpi/numa/srat.c @@ -35,6 +35,7 @@ int pxm_to_node(int pxm) return NUMA_NO_NODE; return pxm_to_node_map[pxm]; } +EXPORT_SYMBOL(pxm_to_node); int node_to_pxm(int node) { diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 5d1dcaa6fc42..270ddeaec059 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -21,6 +21,8 @@ #include #include +#include + enum virtio_mem_mb_state { /* Unplugged, not added to Linux. Can be reused later. */ VIRTIO_MEM_MB_STATE_UNUSED = 0, @@ -72,6 +74,8 @@ struct virtio_mem { /* The device block size (for communicating with the device). */ uint32_t device_block_size; + /* The translated node id. NUMA_NO_NODE in case not specified. */ + int nid; /* Physical start address of the memory region. */ uint64_t addr; /* Maximum region size in bytes. */ @@ -389,7 +393,10 @@ static int virtio_mem_sb_bitmap_prepare_next_mb(struct virtio_mem *vm) static int virtio_mem_mb_add(struct virtio_mem *vm, unsigned long mb_id) { const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id); - int nid = memory_add_physaddr_to_nid(addr); + int nid = vm->nid; + + if (nid == NUMA_NO_NODE) + nid = memory_add_physaddr_to_nid(addr); dev_dbg(&vm->vdev->dev, "adding memory block: %lu\n", mb_id); return add_memory(nid, addr, memory_block_size_bytes()); @@ -407,7 +414,10 @@ static int virtio_mem_mb_add(struct virtio_mem *vm, unsigned long mb_id) static int virtio_mem_mb_remove(struct virtio_mem *vm, unsigned long mb_id) { const uint64_t addr = virtio_mem_mb_id_to_phys(mb_id); - int nid = memory_add_physaddr_to_nid(addr); + int nid = vm->nid; + + if (nid == NUMA_NO_NODE) + nid = memory_add_physaddr_to_nid(addr); dev_dbg(&vm->vdev->dev, "removing memory block: %lu\n", mb_id); return remove_memory(nid, addr, memory_block_size_bytes()); @@ -426,6 +436,17 @@ static void virtio_mem_retry(struct virtio_mem *vm) spin_unlock_irqrestore(&vm->removal_lock, flags); } +static int virtio_mem_translate_node_id(struct virtio_mem *vm, uint16_t node_id) +{ + int node = NUMA_NO_NODE; + +#if defined(CONFIG_ACPI_NUMA) + if (virtio_has_feature(vm->vdev, VIRTIO_MEM_F_ACPI_PXM)) + node = pxm_to_node(node_id); +#endif + return node; +} + /* * Test if a virtio-mem device overlaps with the given range. Can be called * from (notifier) callbacks lockless. @@ -1267,6 +1288,7 @@ static bool virtio_mem_any_memory_present(unsigned long start, static int virtio_mem_init(struct virtio_mem *vm) { const uint64_t phys_limit = 1UL << MAX_PHYSMEM_BITS; + uint16_t node_id; if (!vm->vdev->config->get) { dev_err(&vm->vdev->dev, "config access disabled\n"); @@ -1287,6 +1309,9 @@ static int virtio_mem_init(struct virtio_mem *vm) &vm->plugged_size); virtio_cread(vm->vdev, struct virtio_mem_config, block_size, &vm->device_block_size); + virtio_cread(vm->vdev, struct virtio_mem_config, node_id, + &node_id); + vm->nid = virtio_mem_translate_node_id(vm, node_id); virtio_cread(vm->vdev, struct virtio_mem_config, addr, &vm->addr); virtio_cread(vm->vdev, struct virtio_mem_config, region_size, &vm->region_size); @@ -1365,6 +1390,8 @@ static int virtio_mem_init(struct virtio_mem *vm) memory_block_size_bytes()); dev_info(&vm->vdev->dev, "subblock size: 0x%x", vm->subblock_size); + if (vm->nid != NUMA_NO_NODE) + dev_info(&vm->vdev->dev, "nid: %d", vm->nid); return 0; } @@ -1508,12 +1535,20 @@ static int virtio_mem_restore(struct virtio_device *vdev) } #endif +static unsigned int virtio_mem_features[] = { +#if defined(CONFIG_NUMA) && defined(CONFIG_ACPI_NUMA) + VIRTIO_MEM_F_ACPI_PXM, +#endif +}; + static struct virtio_device_id virtio_mem_id_table[] = { { VIRTIO_ID_MEM, VIRTIO_DEV_ANY_ID }, { 0 }, }; static struct virtio_driver virtio_mem_driver = { + .feature_table = virtio_mem_features, + .feature_table_size = ARRAY_SIZE(virtio_mem_features), .driver.name = KBUILD_MODNAME, .driver.owner = THIS_MODULE, .id_table = virtio_mem_id_table, diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h index 1bfade78bdfd..e0a9dc7397c3 100644 --- a/include/uapi/linux/virtio_mem.h +++ b/include/uapi/linux/virtio_mem.h @@ -83,6 +83,12 @@ * device is busy. */ +/* --- virtio-mem: feature bits --- */ + +/* node_id is an ACPI PXM and is valid */ +#define VIRTIO_MEM_F_ACPI_PXM 0 + + /* --- virtio-mem: guest -> host requests --- */ /* request to plug memory blocks */ @@ -177,7 +183,9 @@ struct virtio_mem_resp { struct virtio_mem_config { /* Block size and alignment. Cannot change. */ __u32 block_size; - __u32 padding; + /* Valid with VIRTIO_MEM_F_ACPI_PXM. Cannot change. */ + __u16 node_id; + __u16 padding; /* Start address of the memory region. Cannot change. */ __u64 addr; /* Region size (maximum). Cannot change. */ -- cgit v1.2.3 From fce8afd76e3a4d8c59c92f84f8027569fd7031d0 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Fri, 15 May 2020 12:14:02 +0200 Subject: virtio-mem: Don't rely on implicit compiler padding for requests The compiler will add padding after the last member, make that explicit. The size of a request is always 24 bytes. The size of a response always 10 bytes. Add compile-time checks. Cc: "Michael S. Tsirkin" Cc: Pankaj Gupta Cc: teawater Signed-off-by: David Hildenbrand Link: https://lore.kernel.org/r/20200515101402.16597-1-david@redhat.com Signed-off-by: Michael S. Tsirkin --- drivers/virtio/virtio_mem.c | 3 +++ include/uapi/linux/virtio_mem.h | 3 +++ 2 files changed, 6 insertions(+) (limited to 'include/uapi/linux/virtio_mem.h') diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 9e523db3bee1..f658fe9149be 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -1770,6 +1770,9 @@ static int virtio_mem_probe(struct virtio_device *vdev) struct virtio_mem *vm; int rc = -EINVAL; + BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24); + BUILD_BUG_ON(sizeof(struct virtio_mem_resp) != 10); + vdev->priv = vm = kzalloc(sizeof(*vm), GFP_KERNEL); if (!vm) return -ENOMEM; diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h index e0a9dc7397c3..a455c488a995 100644 --- a/include/uapi/linux/virtio_mem.h +++ b/include/uapi/linux/virtio_mem.h @@ -103,16 +103,19 @@ struct virtio_mem_req_plug { __virtio64 addr; __virtio16 nb_blocks; + __virtio16 padding[3]; }; struct virtio_mem_req_unplug { __virtio64 addr; __virtio16 nb_blocks; + __virtio16 padding[3]; }; struct virtio_mem_req_state { __virtio64 addr; __virtio16 nb_blocks; + __virtio16 padding[3]; }; struct virtio_mem_req { -- cgit v1.2.3 From 544fc7dbbf920a3e64d109c416ee229e8e1763c5 Mon Sep 17 00:00:00 2001 From: "Michael S. Tsirkin" Date: Mon, 8 Jun 2020 02:03:15 -0400 Subject: virtio_mem: convert device block size into 64bit If subblock size is large (e.g. 1G) 32 bit math involving it can overflow. Rather than try to catch all instances of that, let's tweak block size to 64 bit. It ripples through UAPI which is an ABI change, but it's not too late to make it, and it will allow supporting >4Gbyte blocks while might become necessary down the road. Fixes: 5f1f79bbc9e26 ("virtio-mem: Paravirtualized memory hotplug") Signed-off-by: Michael S. Tsirkin Acked-by: David Hildenbrand --- drivers/virtio/virtio_mem.c | 18 +++++++++--------- include/uapi/linux/virtio_mem.h | 4 ++-- 2 files changed, 11 insertions(+), 11 deletions(-) (limited to 'include/uapi/linux/virtio_mem.h') diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index 2f357142ea5e..50c689f25045 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -77,7 +77,7 @@ struct virtio_mem { uint64_t requested_size; /* The device block size (for communicating with the device). */ - uint32_t device_block_size; + uint64_t device_block_size; /* The translated node id. NUMA_NO_NODE in case not specified. */ int nid; /* Physical start address of the memory region. */ @@ -86,7 +86,7 @@ struct virtio_mem { uint64_t region_size; /* The subblock size. */ - uint32_t subblock_size; + uint64_t subblock_size; /* The number of subblocks per memory block. */ uint32_t nb_sb_per_mb; @@ -1698,9 +1698,9 @@ static int virtio_mem_init(struct virtio_mem *vm) * - At least the device block size. * In the worst case, a single subblock per memory block. */ - vm->subblock_size = PAGE_SIZE * 1u << max_t(uint32_t, MAX_ORDER - 1, - pageblock_order); - vm->subblock_size = max_t(uint32_t, vm->device_block_size, + vm->subblock_size = PAGE_SIZE * 1ul << max_t(uint32_t, MAX_ORDER - 1, + pageblock_order); + vm->subblock_size = max_t(uint64_t, vm->device_block_size, vm->subblock_size); vm->nb_sb_per_mb = memory_block_size_bytes() / vm->subblock_size; @@ -1713,12 +1713,12 @@ static int virtio_mem_init(struct virtio_mem *vm) dev_info(&vm->vdev->dev, "start address: 0x%llx", vm->addr); dev_info(&vm->vdev->dev, "region size: 0x%llx", vm->region_size); - dev_info(&vm->vdev->dev, "device block size: 0x%x", - vm->device_block_size); + dev_info(&vm->vdev->dev, "device block size: 0x%llx", + (unsigned long long)vm->device_block_size); dev_info(&vm->vdev->dev, "memory block size: 0x%lx", memory_block_size_bytes()); - dev_info(&vm->vdev->dev, "subblock size: 0x%x", - vm->subblock_size); + dev_info(&vm->vdev->dev, "subblock size: 0x%llx", + (unsigned long long)vm->subblock_size); if (vm->nid != NUMA_NO_NODE) dev_info(&vm->vdev->dev, "nid: %d", vm->nid); diff --git a/include/uapi/linux/virtio_mem.h b/include/uapi/linux/virtio_mem.h index a455c488a995..a9ffe041843c 100644 --- a/include/uapi/linux/virtio_mem.h +++ b/include/uapi/linux/virtio_mem.h @@ -185,10 +185,10 @@ struct virtio_mem_resp { struct virtio_mem_config { /* Block size and alignment. Cannot change. */ - __u32 block_size; + __u64 block_size; /* Valid with VIRTIO_MEM_F_ACPI_PXM. Cannot change. */ __u16 node_id; - __u16 padding; + __u8 padding[6]; /* Start address of the memory region. Cannot change. */ __u64 addr; /* Region size (maximum). Cannot change. */ -- cgit v1.2.3