From 86efc62d031307e53ad4011e0aa8898e029cef47 Mon Sep 17 00:00:00 2001 From: Gregory Price Date: Fri, 4 Oct 2024 12:28:28 -0400 Subject: PCI/DOE: Poll DOE Busy bit for up to 1 second in pci_doe_send_req() During initial device probe, the PCI DOE busy bit for some CXL devices may be left set for a longer period than expected by the current driver logic. Despite local comments stating DOE Busy is unlikely to be detected, it appears commonly specifically during boot when CXL devices are being probed. The symptom was messages like this: endpoint6: DOE failed -EBUSY produced by cxl_cdat_get_length() or cxl_cdat_read_table(). This was observed on a single socket AMD platform with 2 CXL memory expanders attached to the single socket. It was not the case that concurrent accesses were being made, as validated by monitoring mailbox commands on the device side. This behavior has been observed with multiple CXL memory expanders from different vendors - so it appears unrelated to the model. In all observed tests, only a small period of the retry window is actually used - typically only a handful of loop iterations. Polling on the PCI DOE Busy Bit for (at max) one PCI DOE timeout interval (1 second), resolves this issue cleanly. Per PCIe r6.2 sec 6.30.3, the DOE Busy Bit being cleared does not raise an interrupt, so polling is the best option in this scenario. Subsequent code in doe_statemachine_work() and abort paths also wait for up to 1 PCI DOE timeout interval, so this order of (potential) additional delay is presumed acceptable. Suggested-by: Lukas Wunner Link: https://lore.kernel.org/r/20241004162828.314-1-gourry@gourry.net Signed-off-by: Gregory Price [bhelgaas: fix nits and add error message to commit log] Signed-off-by: Bjorn Helgaas Reviewed-by: Lukas Wunner Reviewed-by: Jonathan Cameron --- drivers/pci/doe.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) (limited to 'drivers/pci') diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c index 652d63df9d22..7bd7892c5222 100644 --- a/drivers/pci/doe.c +++ b/drivers/pci/doe.c @@ -146,6 +146,7 @@ static int pci_doe_send_req(struct pci_doe_mb *doe_mb, { struct pci_dev *pdev = doe_mb->pdev; int offset = doe_mb->cap_offset; + unsigned long timeout_jiffies; size_t length, remainder; u32 val; int i; @@ -155,8 +156,19 @@ static int pci_doe_send_req(struct pci_doe_mb *doe_mb, * someone other than Linux (e.g. firmware) is using the mailbox. Note * it is expected that firmware and OS will negotiate access rights via * an, as yet to be defined, method. + * + * Wait up to one PCI_DOE_TIMEOUT period to allow the prior command to + * finish. Otherwise, simply error out as unable to field the request. + * + * PCIe r6.2 sec 6.30.3 states no interrupt is raised when the DOE Busy + * bit is cleared, so polling here is our best option for the moment. */ - pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT; + do { + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + } while (FIELD_GET(PCI_DOE_STATUS_BUSY, val) && + !time_after(jiffies, timeout_jiffies)); + if (FIELD_GET(PCI_DOE_STATUS_BUSY, val)) return -EBUSY; -- cgit v1.2.3