From 639781dcab8261f39c7028db4ed4fd0e760d69fa Mon Sep 17 00:00:00 2001 From: Oded Gabbay Date: Fri, 2 Apr 2021 01:43:18 +0300 Subject: habanalabs/gaudi: add debugfs to DMA from the device When trying to debug program, the user often needs to dump large parts of the device's DRAM, which can reach to tens of GBs. Because reading from the device's internal memory through the PCI BAR is extremely slow, the debug can take hours. Instead, we can provide the user to copy data through one of the DMA engines. This will make the operation much faster. Currently, only GAUDI is supported. In GAUDI, we need to find a PCI DMA engine that is IDLE and set the DMA as secured to be able to bypass our MMU as we currently don't map the temporary buffer to the MMU. Example bash one-line to dump entire HBM to file (~2 minutes): for (( i=0x0; i < 0x800000000; i+=0x8000000 )); do \ printf '0x%x\n' $i | sudo tee /sys/kernel/debug/habanalabs/hl0/addr ; \ echo 0x8000000 | sudo tee /sys/kernel/debug/habanalabs/hl0/dma_size ; \ sudo cat /sys/kernel/debug/habanalabs/hl0/data_dma >> hbm.txt ; done Signed-off-by: Oded Gabbay --- .../ABI/testing/debugfs-driver-habanalabs | 68 +++++++++++++++++----- 1 file changed, 53 insertions(+), 15 deletions(-) (limited to 'Documentation/ABI') diff --git a/Documentation/ABI/testing/debugfs-driver-habanalabs b/Documentation/ABI/testing/debugfs-driver-habanalabs index f9e233cbdc37..c78fc9282876 100644 --- a/Documentation/ABI/testing/debugfs-driver-habanalabs +++ b/Documentation/ABI/testing/debugfs-driver-habanalabs @@ -82,6 +82,24 @@ Description: Allows the root user to read or write 64 bit data directly If the IOMMU is disabled, it also allows the root user to read or write from the host a device VA of a host mapped memory +What: /sys/kernel/debug/habanalabs/hl/data_dma +Date: Apr 2021 +KernelVersion: 5.13 +Contact: ogabbay@kernel.org +Description: Allows the root user to read from the device's internal + memory (DRAM/SRAM) through a DMA engine. + This property is a binary blob that contains the result of the + DMA transfer. + This custom interface is needed (instead of using the generic + Linux user-space PCI mapping) because the amount of internal + memory is huge (>32GB) and reading it via the PCI bar will take + a very long time. + This interface doesn't support concurrency in the same device. + In GAUDI and GOYA, this action can cause undefined behavior + in case the it is done while the device is executing user + workloads. + Only supported on GAUDI at this stage. + What: /sys/kernel/debug/habanalabs/hl/device Date: Jan 2019 KernelVersion: 5.1 @@ -90,6 +108,24 @@ Description: Enables the root user to set the device to specific state. Valid values are "disable", "enable", "suspend", "resume". User can read this property to see the valid values +What: /sys/kernel/debug/habanalabs/hl/dma_size +Date: Apr 2021 +KernelVersion: 5.13 +Contact: ogabbay@kernel.org +Description: Specify the size of the DMA transaction when using DMA to read + from the device's internal memory. The value can not be larger + than 128MB. Writing to this value initiates the DMA transfer. + When the write is finished, the user can read the "data_dma" + blob + +What: /sys/kernel/debug/habanalabs/hl/dump_security_violations +Date: Jan 2021 +KernelVersion: 5.12 +Contact: ogabbay@kernel.org +Description: Dumps all security violations to dmesg. This will also ack + all security violations meanings those violations will not be + dumped next time user calls this API + What: /sys/kernel/debug/habanalabs/hl/engines Date: Jul 2019 KernelVersion: 5.3 @@ -154,6 +190,16 @@ Description: Displays the hop values and physical address for a given ASID e.g. to display info about VA 0x1000 for ASID 1 you need to do: echo "1 0x1000" > /sys/kernel/debug/habanalabs/hl0/mmu +What: /sys/kernel/debug/habanalabs/hl/mmu_error +Date: Mar 2021 +KernelVersion: 5.12 +Contact: fkassabri@habana.ai +Description: Check and display page fault or access violation mmu errors for + all MMUs specified in mmu_cap_mask. + e.g. to display error info for MMU hw cap bit 9, you need to do: + echo "0x200" > /sys/kernel/debug/habanalabs/hl0/mmu_error + cat /sys/kernel/debug/habanalabs/hl0/mmu_error + What: /sys/kernel/debug/habanalabs/hl/set_power_state Date: Jan 2019 KernelVersion: 5.1 @@ -161,6 +207,13 @@ Contact: ogabbay@kernel.org Description: Sets the PCI power state. Valid values are "1" for D0 and "2" for D3Hot +What: /sys/kernel/debug/habanalabs/hl/stop_on_err +Date: Mar 2020 +KernelVersion: 5.6 +Contact: ogabbay@kernel.org +Description: Sets the stop-on_error option for the device engines. Value of + "0" is for disable, otherwise enable. + What: /sys/kernel/debug/habanalabs/hl/userptr Date: Jan 2019 KernelVersion: 5.1 @@ -175,18 +228,3 @@ KernelVersion: 5.1 Contact: ogabbay@kernel.org Description: Displays a list with information about all the active virtual address mappings per ASID and all user mappings of HW blocks - -What: /sys/kernel/debug/habanalabs/hl/stop_on_err -Date: Mar 2020 -KernelVersion: 5.6 -Contact: ogabbay@kernel.org -Description: Sets the stop-on_error option for the device engines. Value of - "0" is for disable, otherwise enable. - -What: /sys/kernel/debug/habanalabs/hl/dump_security_violations -Date: Jan 2021 -KernelVersion: 5.12 -Contact: ogabbay@kernel.org -Description: Dumps all security violations to dmesg. This will also ack - all security violations meanings those violations will not be - dumped next time user calls this API -- cgit v1.2.3