<feed xmlns='http://www.w3.org/2005/Atom'>
<title>kernel/linux.git/drivers/pci/pcie/err.c, branch v7.1</title>
<subtitle>Linux kernel stable tree (mirror)</subtitle>
<id>https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1</id>
<link rel='self' href='https://git.radix-linux.su/kernel/linux.git/atom?h=v7.1'/>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/'/>
<updated>2025-08-13T18:20:56+00:00</updated>
<entry>
<title>PCI/ERR: Update device error_state already after reset</title>
<updated>2025-08-13T18:20:56+00:00</updated>
<author>
<name>Lukas Wunner</name>
<email>lukas@wunner.de</email>
</author>
<published>2025-08-13T05:11:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=45bc82563d5505327d97963bc54d3709939fa8f8'/>
<id>urn:sha1:45bc82563d5505327d97963bc54d3709939fa8f8</id>
<content type='text'>
After a Fatal Error has been reported by a device and has been recovered
through a Secondary Bus Reset, AER updates the device's error_state to
pci_channel_io_normal before invoking its driver's -&gt;resume() callback.

By contrast, EEH updates the error_state earlier, namely after resetting
the device and before invoking its driver's -&gt;slot_reset() callback.
Commit c58dc575f3c8 ("powerpc/pseries: Set error_state to
pci_channel_io_normal in eeh_report_reset()") explains in great detail
that the earlier invocation is necessitated by various drivers checking
accessibility of the device with pci_channel_offline() and avoiding
accesses if it returns true.  It returns true for any other error_state
than pci_channel_io_normal.

The device should be accessible already after reset, hence the reasoning
is that it's safe to update the error_state immediately afterwards.

This deviation between AER and EEH seems problematic because drivers
behave differently depending on which error recovery mechanism the
platform uses.  Three drivers have gone so far as to update the
error_state themselves, presumably to work around AER's behavior.

For consistency, amend AER to update the error_state at the same recovery
steps as EEH.  Drop the now unnecessary workaround from the three drivers.

Keep updating the error_state before -&gt;resume() in case -&gt;error_detected()
or -&gt;mmio_enabled() return PCI_ERS_RESULT_RECOVERED, which causes
-&gt;slot_reset() to be skipped.  There are drivers doing this even for Fatal
Errors, e.g. mhi_pci_error_detected().

Signed-off-by: Lukas Wunner &lt;lukas@wunner.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Link: https://patch.msgid.link/4517af6359ffb9d66152b827a5d2833459144e3f.1755008151.git.lukas@wunner.de
</content>
</entry>
<entry>
<title>PCI/ERR: Notify drivers on failure to recover</title>
<updated>2025-08-13T18:20:46+00:00</updated>
<author>
<name>Lukas Wunner</name>
<email>lukas@wunner.de</email>
</author>
<published>2025-08-13T05:11:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=9011f0667c93df2034f2ff33eae37ddf9bdd0ef9'/>
<id>urn:sha1:9011f0667c93df2034f2ff33eae37ddf9bdd0ef9</id>
<content type='text'>
According to Documentation/PCI/pci-error-recovery.rst, the following shall
occur on failure to recover from a PCIe Uncorrectable Error:

  STEP 6: Permanent Failure
  -------------------------
  A "permanent failure" has occurred, and the platform cannot recover
  the device.  The platform will call error_detected() with a
  pci_channel_state_t value of pci_channel_io_perm_failure.

  The device driver should, at this point, assume the worst. It should
  cancel all pending I/O, refuse all new I/O, returning -EIO to
  higher layers. The device driver should then clean up all of its
  memory and remove itself from kernel operations, much as it would
  during system shutdown.

Sathya notes that AER does not call error_detected() on failure and thus
deviates from the document (as well as EEH, for which the document was
originally added).

Most drivers do nothing on permanent failure, but the SCSI drivers and a
number of Ethernet drivers do take advantage of the notification to flush
queues and give up resources.

Amend AER to notify such drivers and align with the documentation and EEH.

Link: https://lore.kernel.org/r/f496fc0f-64d7-46a4-8562-dba74e31a956@linux.intel.com/
Suggested-by: Sathyanarayanan Kuppuswamy &lt;sathyanarayanan.kuppuswamy@linux.intel.com&gt;
Signed-off-by: Lukas Wunner &lt;lukas@wunner.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Link: https://patch.msgid.link/ec212d4d4f5c65d29349df33acdc9768ff8279d1.1755008151.git.lukas@wunner.de
</content>
</entry>
<entry>
<title>PCI/ERR: Fix uevent on failure to recover</title>
<updated>2025-08-13T18:20:35+00:00</updated>
<author>
<name>Lukas Wunner</name>
<email>lukas@wunner.de</email>
</author>
<published>2025-08-13T05:11:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=1cbc5e25fb70e942a7a735a1f3d6dd391afc9b29'/>
<id>urn:sha1:1cbc5e25fb70e942a7a735a1f3d6dd391afc9b29</id>
<content type='text'>
Upon failure to recover from a PCIe error through AER, DPC or EDR, a
uevent is sent to inform user space about disconnection of the bridge
whose subordinate devices failed to recover.

However the bridge itself is not disconnected.  Instead, a uevent should
be sent for each of the subordinate devices.

Only if the "bridge" happens to be a Root Complex Event Collector or
Integrated Endpoint does it make sense to send a uevent for it (because
there are no subordinate devices).

Right now if there is a mix of subordinate devices with and without
pci_error_handlers, a BEGIN_RECOVERY event is sent for those with
pci_error_handlers but no FAILED_RECOVERY event is ever sent for them
afterwards.  Fix it.

Fixes: 856e1eb9bdd4 ("PCI/AER: Add uevents in AER and EEH error/resume")
Signed-off-by: Lukas Wunner &lt;lukas@wunner.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Cc: stable@vger.kernel.org  # v4.16+
Link: https://patch.msgid.link/68fc527a380821b5d861dd554d2ce42cb739591c.1755008151.git.lukas@wunner.de
</content>
</entry>
<entry>
<title>PCI/AER: Allow drivers to opt in to Bus Reset on Non-Fatal Errors</title>
<updated>2025-08-13T18:20:26+00:00</updated>
<author>
<name>Lukas Wunner</name>
<email>lukas@wunner.de</email>
</author>
<published>2025-08-13T05:11:01+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=d0a2dee7d4585a7ebab45bf63eebeb8b6944e88f'/>
<id>urn:sha1:d0a2dee7d4585a7ebab45bf63eebeb8b6944e88f</id>
<content type='text'>
When Advanced Error Reporting was introduced in September 2006 by commit
6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver"), it
sought to adhere to the recovery flow and callbacks specified in
Documentation/PCI/pci-error-recovery.rst.

That document had been added in January 2006, when Enhanced Error Handling
(EEH) was introduced for PowerPC with commit 065c6359071c ("[PATCH] PCI
Error Recovery: documentation").

However the AER driver deviates from the document in that it never
performs a Secondary Bus Reset on Non-Fatal Errors, but always on Fatal
Errors.  By contrast, EEH allows drivers to opt in or out of a Bus Reset
regardless of error severity, by returning PCI_ERS_RESULT_NEED_RESET or
PCI_ERS_RESULT_CAN_RECOVER from their -&gt;error_detected() callback.  If all
drivers agree that they can recover without a Bus Reset, EEH skips it.
Should one of them request a Bus Reset, it overrides all other drivers.

This inconsistency between EEH and AER seems problematic because drivers
need to be aware of and cope with it.

The file Documentation/PCI/pcieaer-howto.rst hints at a rationale for
always performing a Bus Reset on Fatal Errors:  "Fatal errors [...] cause
the link to be unreliable.  [...] This [reset_link] callback is used to
reset the PCIe physical link when a fatal error happens.  If an error
message indicates a fatal error, [...] performing link reset at upstream
is necessary."

There's no such rationale provided for never performing a Bus Reset on
Non-Fatal Errors.

The "xe" driver has a need to attempt a reset of local units on graphics
cards upon a Non-Fatal Error.  If that is insufficient for recovery, the
driver wants to opt in to a Bus Reset.

Accommodate such use cases and align AER more closely with EEH by
performing a Bus Reset in pcie_do_recovery() if drivers request it and the
faulting device's channel_state is pci_channel_io_normal.  The AER driver
sets this channel_state for Non-Fatal Errors.  For Fatal Errors, it uses
pci_channel_io_frozen.

This limits the deviation from Documentation/PCI/pci-error-recovery.rst
and EEH to the unconditional Bus Reset on Fatal Errors.

pcie_do_recovery() is also invoked by the Downstream Port Containment and
Error Disconnect Recover drivers.  They both set the channel_state to
pci_channel_io_frozen, hence pcie_do_recovery() continues to always invoke
the -&gt;reset_subordinates() callback in their case.  That is necessary
because the callback brings the link back up at the containing Downstream
Port.

There are two behavioral changes resulting from this commit:

First, if channel_state is pci_channel_io_normal and one of the affected
drivers returns PCI_ERS_RESULT_NEED_RESET from its -&gt;error_detected()
callback, a Bus Reset will now be performed.  There are drivers doing this
and although it would be possible to avoid a behavioral change by letting
them return PCI_ERS_RESULT_CAN_RECOVER instead, the impression I got from
examination of all drivers is that they actually expect or want a Bus
Reset (cxl_error_detected() is a case in point).  In any case, if they can
cope with a Bus Reset on Fatal Errors, they shouldn't have issues with a
Bus Reset on Non-Fatal Errors.

Second, if channel_state is pci_channel_io_frozen and all affected drivers
return PCI_ERS_RESULT_CAN_RECOVER from -&gt;error_detected(), their
-&gt;mmio_enabled() callback is now invoked prior to performing a Bus Reset,
instead of afterwards.  This actually makes sense:  For example,
drivers/scsi/sym53c8xx_2/sym_glue.c dumps debug registers in its
-&gt;mmio_enabled() callback.  Doing so after reset right now captures the
post-reset state instead of the faulting state, which is useless.

There is only one other driver which implements -&gt;mmio_enabled() and
returns PCI_ERS_RESULT_CAN_RECOVER from -&gt;error_detected() for
channel_state pci_channel_io_frozen, drivers/scsi/ipr.c (IBM Power RAID).
It appears to only be used on EEH platforms.  So the second behavioral
change is limited to these two drivers.

Signed-off-by: Lukas Wunner &lt;lukas@wunner.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Link: https://patch.msgid.link/28fd805043bb57af390168d05abb30898cf4fc58.1755008151.git.lukas@wunner.de
</content>
</entry>
<entry>
<title>PCI/ERR: Remove misleading TODO regarding kernel panic</title>
<updated>2025-05-30T17:21:19+00:00</updated>
<author>
<name>Manivannan Sadhasivam</name>
<email>manivannan.sadhasivam@linaro.org</email>
</author>
<published>2025-05-08T07:10:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=b06d125e6280603a34d9064cd9c12748ca2edb04'/>
<id>urn:sha1:b06d125e6280603a34d9064cd9c12748ca2edb04</id>
<content type='text'>
A PCI device is just another peripheral in a system. So failure to
recover it, must not result in a kernel panic. So remove the TODO which
is quite misleading.

Signed-off-by: Manivannan Sadhasivam &lt;manivannan.sadhasivam@linaro.org&gt;
Signed-off-by: Krzysztof Wilczyński &lt;kwilczynski@kernel.org&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Reviewed-by: Wilfred Mallawa &lt;wilfred.mallawa@wdc.com&gt;
Link: https://patch.msgid.link/20250508-pcie-reset-slot-v4-1-7050093e2b50@linaro.org
</content>
</entry>
<entry>
<title>PCI/ERR: Cleanup misleading indentation inside if conditions</title>
<updated>2024-04-29T15:33:29+00:00</updated>
<author>
<name>Ilpo Järvinen</name>
<email>ilpo.jarvinen@linux.intel.com</email>
</author>
<published>2024-04-29T09:47:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=c9758cc45c2bd442743848eba88e187bda3047f9'/>
<id>urn:sha1:c9758cc45c2bd442743848eba88e187bda3047f9</id>
<content type='text'>
A few if conditions align misleadingly with the following code block.  The
checks are really cascading NULL checks that fit into 80 chars so remove
newlines in between and realign to the if condition indent.

Link: https://lore.kernel.org/r/20240429094707.2529-1-ilpo.jarvinen@linux.intel.com
Signed-off-by: Ilpo Järvinen &lt;ilpo.jarvinen@linux.intel.com&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
</content>
</entry>
<entry>
<title>PCI/AER: Block runtime suspend when handling errors</title>
<updated>2024-03-07T23:58:07+00:00</updated>
<author>
<name>Stanislaw Gruszka</name>
<email>stanislaw.gruszka@linux.intel.com</email>
</author>
<published>2024-02-12T12:01:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=002bf2fbc00e5c4b95fb167287e2ae7d1973281e'/>
<id>urn:sha1:002bf2fbc00e5c4b95fb167287e2ae7d1973281e</id>
<content type='text'>
PM runtime can be done simultaneously with AER error handling.  Avoid that
by using pm_runtime_get_sync() before and pm_runtime_put() after reset in
pcie_do_recovery() for all recovering devices.

pm_runtime_get_sync() will increase dev-&gt;power.usage_count counter to
prevent any possible future request to runtime suspend a device.  It will
also resume a device, if it was previously in D3hot state.

I tested with igc device by doing simultaneous aer_inject and rpm
suspend/resume via /sys/bus/pci/devices/PCI_ID/power/control and can
reproduce:

  igc 0000:02:00.0: not ready 65535ms after bus reset; giving up
  pcieport 0000:00:1c.2: AER: Root Port link has been reset (-25)
  pcieport 0000:00:1c.2: AER: subordinate device reset failed
  pcieport 0000:00:1c.2: AER: device recovery failed
  igc 0000:02:00.0: Unable to change power state from D3hot to D0, device inaccessible

The problem disappears when this patch is applied.

Link: https://lore.kernel.org/r/20240212120135.146068-1-stanislaw.gruszka@linux.intel.com
Signed-off-by: Stanislaw Gruszka &lt;stanislaw.gruszka@linux.intel.com&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Reviewed-by: Kuppuswamy Sathyanarayanan &lt;sathyanarayanan.kuppuswamy@linux.intel.com&gt;
Acked-by: Rafael J. Wysocki &lt;rafael@kernel.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
</content>
</entry>
<entry>
<title>PCI/ERR: Recognize disconnected devices in report_error_detected()</title>
<updated>2022-06-08T20:08:40+00:00</updated>
<author>
<name>Christoph Hellwig</name>
<email>hch@lst.de</email>
</author>
<published>2022-06-01T07:40:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=5e69a33c5cec019dd8ca46e31749c6dc78f7cbf3'/>
<id>urn:sha1:5e69a33c5cec019dd8ca46e31749c6dc78f7cbf3</id>
<content type='text'>
When a device is already unplugged by pciehp by the time the AER handler is
invoked, the PCIe device will already be in the pci_channel_io_perm_failure
state.  In that case simply return PCI_ERS_RESULT_DISCONNECT instead of
trying to do a state transition that will fail.

Also untangle the state transition failure from the lack of methods to
improve the debugging output in case it happens again.

Link: https://lore.kernel.org/r/20220601074024.3481035-1-hch@lst.de
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
Reviewed-by: Kuppuswamy Sathyanarayanan &lt;sathyanarayanan.kuppuswamy@linux.intel.com&gt;
</content>
</entry>
<entry>
<title>Revert "PCI: Use to_pci_driver() instead of pci_dev-&gt;driver"</title>
<updated>2021-11-11T19:36:22+00:00</updated>
<author>
<name>Bjorn Helgaas</name>
<email>bhelgaas@google.com</email>
</author>
<published>2021-11-10T18:03:34+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=e0217c5ba10d7bf640f038b2feae58e630f2f958'/>
<id>urn:sha1:e0217c5ba10d7bf640f038b2feae58e630f2f958</id>
<content type='text'>
This reverts commit 2a4d9408c9e8b6f6fc150c66f3fef755c9e20d4a.

Robert reported a NULL pointer dereference caused by the PCI core
(local_pci_probe()) calling the i2c_designware_pci driver's
.runtime_resume() method before the .probe() method.  i2c_dw_pci_resume()
depends on initialization done by i2c_dw_pci_probe().

Prior to 2a4d9408c9e8 ("PCI: Use to_pci_driver() instead of
pci_dev-&gt;driver"), pci_pm_runtime_resume() avoided calling the
.runtime_resume() method because pci_dev-&gt;driver had not been set yet.

2a4d9408c9e8 and b5f9c644eb1b ("PCI: Remove struct pci_dev-&gt;driver"),
removed pci_dev-&gt;driver, replacing it by device-&gt;driver, which *has* been
set by this time, so pci_pm_runtime_resume() called the .runtime_resume()
method when it previously had not.

Fixes: 2a4d9408c9e8 ("PCI: Use to_pci_driver() instead of pci_dev-&gt;driver")
Link: https://lore.kernel.org/linux-i2c/CAP145pgdrdiMAT7=-iB1DMgA7t_bMqTcJL4N0=6u8kNY3EU0dw@mail.gmail.com/
Reported-by: Robert Święcki &lt;robert@swiecki.net&gt;
Tested-by: Robert Święcki &lt;robert@swiecki.net&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
</content>
</entry>
<entry>
<title>PCI: Use to_pci_driver() instead of pci_dev-&gt;driver</title>
<updated>2021-10-18T14:20:15+00:00</updated>
<author>
<name>Uwe Kleine-König</name>
<email>u.kleine-koenig@pengutronix.de</email>
</author>
<published>2021-10-12T21:11:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.radix-linux.su/kernel/linux.git/commit/?id=2a4d9408c9e8b6f6fc150c66f3fef755c9e20d4a'/>
<id>urn:sha1:2a4d9408c9e8b6f6fc150c66f3fef755c9e20d4a</id>
<content type='text'>
Struct pci_driver contains a struct device_driver, so for PCI devices, it's
easy to convert a device_driver * to a pci_driver * with to_pci_driver().
The device_driver * is in struct device, so we don't need to also keep
track of the pci_driver * in struct pci_dev.

Replace pci_dev-&gt;driver with to_pci_driver().  This is a step toward
removing pci_dev-&gt;driver.

[bhelgaas: split to separate patch]
Link: https://lore.kernel.org/r/20211004125935.2300113-11-u.kleine-koenig@pengutronix.de
Signed-off-by: Uwe Kleine-König &lt;u.kleine-koenig@pengutronix.de&gt;
Signed-off-by: Bjorn Helgaas &lt;bhelgaas@google.com&gt;
</content>
</entry>
</feed>
