summaryrefslogtreecommitdiff
path: root/drivers/misc/habanalabs/common
AgeCommit message (Collapse)AuthorFilesLines
2022-07-12habanalabs: Use the bitmap API to allocate bitmapsChristophe JAILLET1-3/+2
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them. It is less verbose and it improves the semantic. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: make sure variable is set before usedOded Gabbay1-1/+1
timestamp could be unset in both _hl_interrupt_wait_ioctl() and _hl_interrupt_wait_ioctl_user_addr() so it is better to explicitly initialize it to 0 when declaring it. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't declare tmp twice in same functionOded Gabbay1-2/+2
tmp is declared in the scope of the function cs_do_release() and inside a block inside that function. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: do not set max power on a secured deviceOfir Bitton1-2/+4
Max power API is not supported in secured devices. Hence, we should skip setting it during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: allow detection of unsupported f/w packetsOded Gabbay1-4/+8
If we send a packet to the f/w, and that packet is unsupported, we want to be able to identify this situation and possibly ignore this. Therefore, if the f/w returned an error, we need to propagate it to the callers in the result value, if those callers were interested in it. In addition, no point of printing the error code here because each caller prints its own error with a specific message. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: save f/w preboot minor versionSagiv Ozeri2-9/+44
We need this property for backward compatibility against the f/w. Signed-off-by: Sagiv Ozeri <sozeri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add support for common decoder interruptsOfir Bitton3-6/+11
User application should be able to get notification for any decoder completion. Hence, we introduce a new interface in which a user can wait for all current decoder pending interrupts. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: naming refactor of user interrupt flowOfir Bitton3-13/+13
Current naming convention can be misleading. Hence renaming some variables and defines in order to be more explicit. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: wait for preboot ready after hard resetOhad Sharabi2-27/+65
Currently we are not waiting for preboot ready after hard reset. This leads to a race in which COMMs protocol begins but will get no response from the f/w. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: enable gaudi2 code in driverOded Gabbay3-7/+63
Enable the Gaudi2 ASIC code in the pci probe callback of the driver so the driver will handle Gaudi2 ASICs. Add the PCI ID to the PCI table and add the ASIC enum value to all relevant places. Fixup the device parameters initialization for Gaudi2. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 MMU supportMoti Haimovski7-33/+1064
Gaudi2 has new MMU units. A PMMU for device->host accesses, and HMMU for HBM accesses. The page tables of both MMUs are located in the host's memory (referred to in the code as host-resident pgt). Signed-off-by: Moti Haimovski <mhaimovski@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 wait-for-CS supportOded Gabbay7-89/+210
In Gaudi2 we moved to a different wait for command submission completion model. Instead of receiving interrupt only on external queues, we use the device's sync manager to notify us when the entire command submission finishes. This enables us to remove the categorization of queues to external and internal, and treat each queue equally, without the need to parse and patch any command buffer. This change also requires refactoring to the IRQ handling of CS completions. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add generic security moduleOfir Bitton3-2/+671
As the ASICs become more complex and have many more registers, we need a better way to configure the security properties. As a reminder, we have two dedicated mechanisms for security: Range Registers and Protection bits. Those mechanisms protect sensitive memory and configuration areas inside the device. The generic module handles the low-level part of the configuration, because the configuration mechanism is identical in all ASICs. The difference is the address ranges and register names. Any ASIC that use this block should first block all the register blocks in the ASIC. Then, it should open only the registers that need to be accessed by the user (This is opposed to Goya and Gaudi, where we blocked only what should not be accesses by the user). The module contains several functions, to unblock single register, multiple registers, entire blocks, ranges, ranges with mask. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove obsolete device variables used for testingOded Gabbay4-43/+23
There are a couple of device variables that are used for testing purposes and they are set to fixed values. Remove the variables that are not relevant anymore and document the remaining variables. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: initialize new asic propertiesOded Gabbay1-7/+14
New asic properties were added for Gaudi2. We want to initialize and use them, when relevant, also for Goya and Gaudi. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add gaudi2 asic-specific codeOded Gabbay13-78/+437
Add the ASIC-specific code for Gaudi2. Supply (almost) all of the function callbacks that the driver's common code need to initialize, finalize and submit workloads to the Gaudi2 ASIC. It also contains the code to initialize the F/W of the Gaudi2 ASIC and to receive events from the F/W. It contains new debugfs entry to dump razwi events. razwi is a case where the device's engines create a transaction that reaches an invalid destination. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove redundant argument in access_dev_mem APIsOfir Bitton3-9/+7
Region structure is derived from region type, hence no need to pass it as an argument. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: communicate supported page sizes to userOhad Sharabi3-5/+6
Because in future ASICs the driver will allow the user to set the page size we need to make sure this data is propagated in all APIs. In addition, since this is already an ASIC property we no longer need ASIC function for it. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove dead code from free_device_memory()Tomer Tayar1-28/+22
free_device_memory() ends with if and else, each has a return statement, followed by another return statement that can never be reached. Restructure the function and remove this dead code. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: page size can only be a power of 2Ohad Sharabi2-5/+2
We dropped support for page sizes that are not power of 2. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: refactor dma asic-specific functionsOhad Sharabi6-67/+118
This is a pre-requisite patch for adding tracepoints to the DMA memory operations (allocation/free) in the driver. The main purpose is to be able to cross data with the map operations and determine whether memory violation occurred, for example free DMA allocation before unmapping it from device memory. To achieve this the DMA alloc/free code flows were refactored so that a single DMA tracepoint will catch many flows. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: set default value for memory_scrubDafna Hirschfeld1-0/+3
Set a default value for memory scrubbing Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: move call to scrub_device_mem after ctx_finiDafna Hirschfeld2-5/+14
In future ASICs, it would be possible to have a non-idle device when context is released. We thus need to postpone the scrubbing. Postpone it to hpriv release if reset is not executed or to device late init if reset is executed. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't send addr and size to scrub_device_mem cbDafna Hirschfeld2-3/+3
We use scrub_device_mem only to scrub the entire SRAM and entire DRAM. Therefore there is no need to send addr and size args to the callback. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: don't do memory scrubbing when unmappingDafna Hirschfeld1-30/+6
There is no need to do memory scrub when unmapping anymore as it is an overhead as long as we have a single user at any given time. Remove that code and change return value of free_phys_pg_pack to void Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: print if firmware is secured during loadOfir Bitton1-1/+2
For easier debug, it is desirable to have a simple way to know whether the device is secured or not, hence we dump this indication during boot. Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs/gaudi: fix a race condition causing DMAR errorYuri Nudelman2-0/+2
There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: move memory_scrub_val to hdev structDafna Hirschfeld2-4/+4
move the field memory_scrub_val from struct hl_dbg_device_entry to struct hl_device. This is because we want to use this field also if debugfs is off. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fix comment styleOded Gabbay1-1/+1
function name should not be preceded with @ Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: use kvcalloc when possibleOded Gabbay1-3/+2
kvcalloc is same as kvmalloc_array with GFP_ZERO. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: print pointer with correct modifierOded Gabbay1-2/+2
Use %p instead of %llx for printing pointers. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: check fence pointer before useOded Gabbay1-1/+1
fence pointer can be NULL in this path, as shown by an earlier check. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: remove unused get_dma_desc_list_sizeOded Gabbay1-3/+0
This asic callback function is not called anymore from the common code. The asic-specific function itself is called but from within the asic-specific code. Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fix NULL dereference on cs timeoutYuri Nudelman1-2/+2
Device descriptor is accessed before an assignment Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add validity check for cq counter offsetfarah kassabri1-1/+8
Driver performs no validity check for the user cq counter offset used in both wait_for_interrupt and register_for_timestamp APIs. Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: avoid unnecessary error printDani Liberman1-1/+8
When sending a packet to FW right after it made reset, we will get packet timeout. Since it is expected behavior, we don't need to print an error in such case. Hence, when driver is in hard reset it will avoid from printing error messages about packet timeout. Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: send an event notification when CS timeout occursTal Cohen1-9/+17
The Driver needs to inform the User process whenever one of its CS is timed out. The Driver shall recognize the CS timeout and shall send an eventfd notification, towards user space, whenever a timeout is expired on a CS. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: expose undefined opcode status via info ioctlTal Cohen1-0/+25
The info ioctl retrieves information on the last undefined opcode occurred. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs/gaudi: collect undefined opcode error infoTal Cohen3-9/+45
when an undefined opcode error occurres, the driver collects the relevant information from the Qman and stores it inside the hdev data structure. An event fd indication is sent towards the user space. Note: another commit shall be followed which will add support to read the error info by an ioctl. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fix race between hl_get_compute_ctx() and hl_ctx_put()Tomer Tayar4-25/+47
hl_get_compute_ctx() is used to get the pointer to the compute context from the hpriv object. The function is called in code paths that are not necessarily initiated by user, so it is possible that a context release process will happen in parallel. This can lead to a race condition in which hl_get_compute_ctx() retrieves the context pointer, and just before it increments the context refcount, the context object is released and a freed memory is accessed. To avoid this race, add a mutex to protect the context pointer in hpriv. With this lock, hl_get_compute_ctx() will be able to detect if the context has been released or is about to be released. struct hl_ctx_mgr has a mutex for contexts IDR with a similar "ctx_lock" name, so rename it to just "lock" to avoid a confusion with the new lock. Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: keep a record of completed CS outcomesYuri Nudelman3-12/+147
Often, the user is not interested in the completion timestamp of all command submissions. A common situation is, for example, when the user submits a burst of, possibly, several thousands of commands, then request the completion timestamp of only couple of specific key commands from all the burst. The problem is that currently, the outcome of the early commands may be lost, due to a large amount of later commands, that the user does not really care about. This patch creates a separate store with the outcomes of commands the user has mark explicitly as interested in. This store does not mix the marked commands with the unmarked ones, hence the data there will survive for much longer. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: change the write flag name of error info structsTal Cohen3-10/+10
positive flags naming will make more clear code while adding more 'error info' structures Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: add terminating NULL to attrs arraysDafna Hirschfeld1-0/+2
Arrays of struct attribute are expected to be NULL terminated. This is required by API methods such as device_add_groups. This fixes a crash when loading the driver for Goya device. Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: Fix kernel-docJiapeng Chong1-2/+2
Fix the following W=1 kernel warnings: drivers/misc/habanalabs/common/mmu/mmu_v1.c:425: warning: expecting prototype for hl_mmu_fini(). Prototype was for hl_mmu_v1_fini() instead. drivers/misc/habanalabs/common/mmu/mmu_v1.c:449: warning: expecting prototype for hl_mmu_ctx_init(). Prototype was for hl_mmu_v1_ctx_init() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: Fix kernel-docJiapeng Chong1-1/+1
Fix the following W=1 kernel warnings: drivers/misc/habanalabs/common/pci/pci.c:454: warning: expecting prototype for hl_fw_fini(). Prototype was for hl_pci_fini() instead. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-07-12habanalabs: fix double unlock on error in map_device_va()Dan Carpenter1-4/+2
If hl_mmu_prefetch_cache_range() fails then this code calls mutex_unlock(&ctx->mmu_lock) when it's no longer holding the mutex. Fixes: 9e495e24003e ("habanalabs: do MMU prefetch as deferred work") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2022-05-22habanalabs: use separate structure info for each error collect dataTal Cohen4-44/+56
Create separate info structure for each error type. The structures shall be used inside the large structure that contains the last session error. This is more scalable for adding more errors in the future. Signed-off-by: Tal Cohen <talcohen@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22habanalabs: fix missing handle shift during mmapYuri Nudelman1-2/+2
During mmap operation on the unified memory manager buffer, the vma page offset is shifted to extract the handle value. Due to a typo, it was not shifted back at the end. That could cause the offset to be modified after mmap operation, that may affect subsequent operations. In addition, in allocation flow, in case of out of memory error, idr would not be correctly destroyed, again because of a missing shift. Signed-off-by: Yuri Nudelman <ynudelman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22habanalabs: remove hdev from hl_ctx_get argsOhad Sharabi6-13/+13
This argument is unused by the function. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-22habanalabs: do MMU prefetch as deferred workOhad Sharabi4-22/+101
When user requests to prefetch the MMU translations, the driver will not block the user until prefetch is done. Instead, the prefetch work will be delegated to a WQ which will do it in the background. This way, the prefetch may progress without blocking the user at all. Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>