diff options
author | Gregory Price <gourry@gourry.net> | 2025-05-12 19:21:25 +0300 |
---|---|---|
committer | Dave Jiang <dave.jiang@intel.com> | 2025-05-13 23:07:45 +0300 |
commit | bef826ead3cde244b1a7393bd0c7eee7bbbec8cf (patch) | |
tree | 775b96a1852552af03ccece38199ecf245bc5564 | |
parent | 9bd8546e59542657a7c5a4ee02ea67d3e68f964a (diff) | |
download | linux-bef826ead3cde244b1a7393bd0c7eee7bbbec8cf.tar.xz |
cxl: docs/linux - early boot configuration
Document __init time configurations that affect CXL driver probe
process and memory region configuration.
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Link: https://patch.msgid.link/20250512162134.3596150-9-gourry@gourry.net
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
-rw-r--r-- | Documentation/driver-api/cxl/index.rst | 1 | ||||
-rw-r--r-- | Documentation/driver-api/cxl/linux/early-boot.rst | 131 |
2 files changed, 132 insertions, 0 deletions
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst index bc2228c77c32..d2eefe575604 100644 --- a/Documentation/driver-api/cxl/index.rst +++ b/Documentation/driver-api/cxl/index.rst @@ -34,6 +34,7 @@ that have impacts on each other. The docs here break up configurations steps. :caption: Linux Kernel Configuration linux/overview + linux/early-boot linux/access-coordinates diff --git a/Documentation/driver-api/cxl/linux/early-boot.rst b/Documentation/driver-api/cxl/linux/early-boot.rst new file mode 100644 index 000000000000..8c1c497bc772 --- /dev/null +++ b/Documentation/driver-api/cxl/linux/early-boot.rst @@ -0,0 +1,131 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Linux Init (Early Boot) +======================= + +Linux configuration is split into two major steps: Early-Boot and everything else. + +During early boot, Linux sets up immutable resources (such as numa nodes), while +later operations include things like driver probe and memory hotplug. Linux may +read EFI and ACPI information throughout this process to configure logical +representations of the devices. + +During Linux Early Boot stage (functions in the kernel that have the __init +decorator), the system takes the resources created by EFI/BIOS (ACPI tables) +and turns them into resources that the kernel can consume. + + +BIOS, Build and Boot Options +============================ + +There are 4 pre-boot options that need to be considered during kernel build +which dictate how memory will be managed by Linux during early boot. + +* EFI_MEMORY_SP + + * BIOS/EFI Option that dictates whether memory is SystemRAM or + Specific Purpose. Specific Purpose memory will be deferred to + drivers to manage - and not immediately exposed as system RAM. + +* CONFIG_EFI_SOFT_RESERVE + + * Linux Build config option that dictates whether the kernel supports + Specific Purpose memory. + +* CONFIG_MHP_DEFAULT_ONLINE_TYPE + + * Linux Build config that dictates whether and how Specific Purpose memory + converted to a dax device should be managed (left as DAX or onlined as + SystemRAM in ZONE_NORMAL or ZONE_MOVABLE). + +* nosoftreserve + + * Linux kernel boot option that dictates whether Soft Reserve should be + supported. Similar to CONFIG_EFI_SOFT_RESERVE. + +Memory Map Creation +=================== + +While the kernel parses the EFI memory map, if :code:`Specific Purpose` memory +is supported and detected, it will set this region aside as +:code:`SOFT_RESERVED`. + +If :code:`EFI_MEMORY_SP=0`, :code:`CONFIG_EFI_SOFT_RESERVE=n`, or +:code:`nosoftreserve=y` - Linux will default a CXL device memory region to +SystemRAM. This will expose the memory to the kernel page allocator in +:code:`ZONE_NORMAL`, making it available for use for most allocations (including +:code:`struct page` and page tables). + +If `Specific Purpose` is set and supported, :code:`CONFIG_MHP_DEFAULT_ONLINE_TYPE_*` +dictates whether the memory is onlined by default (:code:`_OFFLINE` or +:code:`_ONLINE_*`), and if online which zone to online this memory to by default +(:code:`_NORMAL` or :code:`_MOVABLE`). + +If placed in :code:`ZONE_MOVABLE`, the memory will not be available for most +kernel allocations (such as :code:`struct page` or page tables). This may +significant impact performance depending on the memory capacity of the system. + + +NUMA Node Reservation +===================== + +Linux refers to the proximity domains (:code:`PXM`) defined in the SRAT to +create NUMA nodes in :code:`acpi_numa_init`. Typically, there is a 1:1 relation +between :code:`PXM` and NUMA node IDs. + +SRAT is the only ACPI defined way of defining Proximity Domains. Linux chooses +to, at most, map those 1:1 with NUMA nodes. CEDT adds a description of SPA +ranges which Linux may wish to map to one or more NUMA nodes. + +If there are CXL ranges in the CFMWS but not in SRAT, then a fake :code:`PXM` +is created (as of v6.15). In the future, Linux may reject CFMWS not described +by SRAT due to the ambiguity of proximity domain association. + +It is important to note that NUMA node creation cannot be done at runtime. All +possible NUMA nodes are identified at :code:`__init` time, more specifically +during :code:`mm_init`. The CEDT and SRAT must contain sufficient :code:`PXM` +data for Linux to identify NUMA nodes their associated memory regions. + +The relevant code exists in: :code:`linux/drivers/acpi/numa/srat.c`. + +See the Example Platform Configurations section for more information. + +Memory Tiers Creation +===================== +Memory tiers are a collection of NUMA nodes grouped by performance characteristics. +During :code:`__init`, Linux initializes the system with a default memory tier that +contains all nodes marked :code:`N_MEMORY`. + +:code:`memory_tier_init` is called at boot for all nodes with memory online by +default. :code:`memory_tier_late_init` is called during late-init for nodes setup +during driver configuration. + +Nodes are only marked :code:`N_MEMORY` if they have *online* memory. + +Tier membership can be inspected in :: + + /sys/devices/virtual/memory_tiering/memory_tierN/nodelist + 0-1 + +If nodes are grouped which have clear difference in performance, check the HMAT +and CDAT information for the CXL nodes. All nodes default to the DRAM tier, +unless HMAT/CDAT information is reported to the memory_tier component via +`access_coordinates`. + +Contiguous Memory Allocation +============================ +The contiguous memory allocator (CMA) enables reservation of contiguous memory +regions on NUMA nodes during early boot. However, CMA cannot reserve memory +on NUMA nodes that are not online during early boot. :: + + void __init hugetlb_cma_reserve(int order) { + if (!node_online(nid)) + /* do not allow reservations */ + } + +This means if users intend to defer management of CXL memory to the driver, CMA +cannot be used to guarantee huge page allocations. If enabling CXL memory as +SystemRAM in `ZONE_NORMAL` during early boot, CMA reservations per-node can be +made with the :code:`cma_pernuma` or :code:`numa_cma` kernel command line +parameters. |