vfio/nvgrace-gpu: Check the HBM training and C2C link status - kernel/linux.git

diff options

author	Ankit Agrawal <ankita@nvidia.com>	2025-01-24 21:31:01 +0300
committer	Alex Williamson <alex.williamson@redhat.com>	2025-01-27 19:43:33 +0300
commit	d85f69d520e6aca8ae5ce353666e2fc2756eb9e7 (patch)
tree	72e90c9c541b34fe184030d200c8582106afa5b9 /tools/perf/scripts/python/gecko.py
parent	6a9eb2d125ba90d13b45bcfabcddf9f61268f6a8 (diff)
download	linux-d85f69d520e6aca8ae5ce353666e2fc2756eb9e7.tar.xz

vfio/nvgrace-gpu: Check the HBM training and C2C link status

In contrast to Grace Hopper systems, the HBM training has been moved out of the UEFI on the Grace Blackwell systems. This reduces the system bootup time significantly. The onus of checking whether the HBM training has completed thus falls on the module. The HBM training status can be determined from a BAR0 register. Similarly, another BAR0 register exposes the status of the CPU-GPU chip-to-chip (C2C) cache coherent interconnect. Based on testing, 30s is determined to be sufficient to ensure initialization completion on all the Grace based systems. Thus poll these register and check for 30s. If the HBM training is not complete or if the C2C link is not ready, fail the probe. While the time is not required on Grace Hopper systems, it is beneficial to make the check to ensure the device is in an expected state. Hence keeping it generalized to both the generations. Ensure that the BAR0 is enabled before accessing the registers. CC: Alex Williamson <alex.williamson@redhat.com> CC: Kevin Tian <kevin.tian@intel.com> CC: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250124183102.3976-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Diffstat (limited to 'tools/perf/scripts/python/gecko.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: