From 17ab47d2d6d4b79734e3f2092b316fa5792eff97 Mon Sep 17 00:00:00 2001 From: Yuri Nudelman Date: Wed, 22 Jun 2022 12:52:34 +0300 Subject: habanalabs/gaudi: fix a race condition causing DMAR error There is a rare race condition in CB completion mechanism, that can occur under a very high pressure of command submissions. The preconditions for this to happen are: 1. There should be enough command submissions for the pre-allocated patched CB pool to run out of commands. At this stage we start allocating new patched CBs as they arrive. 2. CB size has to be exactly (128*n + 104)B for some n, i.e. 24B below a cache line end. The flow: 1. Two command buffers being completed on different streams, at the same time. Denote those CB1 and CB2. 2. Each command buffer is injected with two messages, 16B each - one for a HBW update of the completion queue, another to raise interrupt. 3. Assume CB1 updated the completion queue and raise the interrupt. 4. Assume CB2 updated the completion queue but did not raise the interrupt yet. 5. The host receives the interrupt. It goes over the completion queue and sees two completions - CB1 and CB2. Release them both. 6. CB2 performs the last command. The problem is that the last command is split between 2 cache lines. So to read the last 8B of the last command, it has to access the host again. Problem is - CB2 is already released. This causes a DMAR error. The solution to this problem is simply to make sure the last two commands in the CB are always in the same cache line, using NOP padding. Signed-off-by: Yuri Nudelman Reviewed-by: Oded Gabbay Signed-off-by: Oded Gabbay --- drivers/misc/habanalabs/goya/goya.c | 4 ++-- drivers/misc/habanalabs/goya/goyaP.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) (limited to 'drivers/misc/habanalabs/goya') diff --git a/drivers/misc/habanalabs/goya/goya.c b/drivers/misc/habanalabs/goya/goya.c index 64590fc55dc9..40c082cafbd7 100644 --- a/drivers/misc/habanalabs/goya/goya.c +++ b/drivers/misc/habanalabs/goya/goya.c @@ -4166,8 +4166,8 @@ int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser) } void goya_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address, - u32 len, u64 cq_addr, u32 cq_val, u32 msix_vec, - bool eb) + u32 len, u32 original_len, u64 cq_addr, u32 cq_val, + u32 msix_vec, bool eb) { struct packet_msg_prot *cq_pkt; u32 tmp; diff --git a/drivers/misc/habanalabs/goya/goyaP.h b/drivers/misc/habanalabs/goya/goyaP.h index 647f57402616..54b5b6125df5 100644 --- a/drivers/misc/habanalabs/goya/goyaP.h +++ b/drivers/misc/habanalabs/goya/goyaP.h @@ -230,8 +230,8 @@ void goya_handle_eqe(struct hl_device *hdev, struct hl_eq_entry *eq_entry); void *goya_get_events_stat(struct hl_device *hdev, bool aggregate, u32 *size); void goya_add_end_of_cb_packets(struct hl_device *hdev, void *kernel_address, - u32 len, u64 cq_addr, u32 cq_val, u32 msix_vec, - bool eb); + u32 len, u32 original_len, u64 cq_addr, u32 cq_val, + u32 msix_vec, bool eb); int goya_cs_parser(struct hl_device *hdev, struct hl_cs_parser *parser); int goya_scrub_device_mem(struct hl_device *hdev, u64 addr, u64 size); void *goya_get_int_queue_base(struct hl_device *hdev, u32 queue_id, -- cgit v1.2.3