summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2019-11-21PCI/PM: Move pci_dev_wait() definition earlierVidya Sagar1-41/+41
Move the definition of pci_dev_wait() above pci_power_up() so that it can be called from the latter with no change in functionality. This is a pure code move with no functional change. Link: https://lore.kernel.org/r/20191120051743.23124-1-vidyas@nvidia.com Signed-off-by: Vidya Sagar <vidyas@nvidia.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-11-21PCI/PM: Add missing link delays required by the PCIe specMika Westerberg3-7/+133
Currently Linux does not follow PCIe spec regarding the required delays after reset. A concrete example is a Thunderbolt add-in-card that consists of a PCIe switch and two PCIe endpoints: +-1b.0-[01-6b]----00.0-[02-6b]--+-00.0-[03]----00.0 TBT controller +-01.0-[04-36]-- DS hotplug port +-02.0-[37]----00.0 xHCI controller \-04.0-[38-6b]-- DS hotplug port The root port (1b.0) and the PCIe switch downstream ports are all PCIe Gen3 so they support 8GT/s link speeds. We wait for the PCIe hierarchy to enter D3cold (runtime): pcieport 0000:00:1b.0: power state changed by ACPI to D3cold When it wakes up from D3cold, according to the PCIe 5.0 section 5.8 the PCIe switch is put to reset and its power is re-applied. This means that we must follow the rules in PCIe 5.0 section 6.6.1. For the PCIe Gen3 ports we are dealing with here, the following applies: With a Downstream Port that supports Link speeds greater than 5.0 GT/s, software must wait a minimum of 100 ms after Link training completes before sending a Configuration Request to the device immediately below that Port. Software can determine when Link training completes by polling the Data Link Layer Link Active bit or by setting up an associated interrupt (see Section 6.7.3.3). Translating this into the above topology we would need to do this (DLLLA stands for Data Link Layer Link Active): 0000:00:1b.0: wait for 100 ms after DLLLA is set before access to 0000:01:00.0 0000:02:00.0: wait for 100 ms after DLLLA is set before access to 0000:03:00.0 0000:02:02.0: wait for 100 ms after DLLLA is set before access to 0000:37:00.0 I've instrumented the kernel with some additional logging so we can see the actual delays performed: pcieport 0000:00:1b.0: power state changed by ACPI to D0 pcieport 0000:00:1b.0: waiting for D3cold delay of 100 ms pcieport 0000:00:1b.0: waiting for D3hot delay of 10 ms pcieport 0000:02:01.0: waiting for D3hot delay of 10 ms pcieport 0000:02:04.0: waiting for D3hot delay of 10 ms For the switch upstream port (01:00.0 reachable through 00:1b.0 root port) we wait for 100 ms but not taking into account the DLLLA requirement. We then wait 10 ms for D3hot -> D0 transition of the root port and the two downstream hotplug ports. This means that we deviate from what the spec requires. Performing the same check for system sleep (s2idle) transitions it turns out to be even worse. None of the mandatory delays are performed. If this would be S3 instead of s2idle then according to PCI FW spec 3.2 section 4.6.8. there is a specific _DSM that allows the OS to skip the delays but this platform does not provide the _DSM and does not go to S3 anyway so no firmware is involved that could already handle these delays. On this particular platform these delays are not actually needed because there is an additional delay as part of the ACPI power resource that is used to turn on power to the hierarchy but since that additional delay is not required by any of standards (PCIe, ACPI) it is not present in the Intel Ice Lake, for example where missing the mandatory delays causes pciehp to start tearing down the stack too early (links are not yet trained). Below is an example how it looks like when this happens: pcieport 0000:83:04.0: pciehp: Slot(4): Card not present pcieport 0000:87:04.0: PME# disabled pcieport 0000:83:04.0: pciehp: pciehp_unconfigure_device: domain:bus:dev = 0000:86:00 pcieport 0000:86:00.0: Refused to change power state, currently in D3 pcieport 0000:86:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x201ff) pcieport 0000:86:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0) ... There is also one reported case (see the bugzilla link below) where the missing delay causes xHCI on a Titan Ridge controller fail to runtime resume when USB-C dock is plugged. This does not involve pciehp but instead the PCI core fails to runtime resume the xHCI device: pcieport 0000:04:02.0: restoring config space at offset 0xc (was 0x10000, writing 0x10020) pcieport 0000:04:02.0: restoring config space at offset 0x4 (was 0x100000, writing 0x100406) xhci_hcd 0000:39:00.0: Refused to change power state, currently in D3 xhci_hcd 0000:39:00.0: restoring config space at offset 0x3c (was 0xffffffff, writing 0x1ff) xhci_hcd 0000:39:00.0: restoring config space at offset 0x38 (was 0xffffffff, writing 0x0) ... Add a new function pci_bridge_wait_for_secondary_bus() that is called on PCI core resume and runtime resume paths accordingly if the bridge entered D3cold (and thus went through reset). This is second attempt to add the missing delays. The previous solution in c2bf1fc212f7 ("PCI: Add missing link delays required by the PCIe spec") was reverted because of two issues it caused: 1. One system become unresponsive after S3 resume due to PME service spinning in pcie_pme_work_fn(). The root port in question reports that the xHCI sent PME but the xHCI device itself does not have PME status set. The PME status bit is never cleared in the root port resulting the indefinite loop in pcie_pme_work_fn(). 2. Slows down resume if the root/downstream port does not support Data Link Layer Active Reporting because pcie_wait_for_link_delay() waits 1100 ms in that case. This version should avoid the above issues because we restrict the delay to happen only if the port went into D3cold. Link: https://lore.kernel.org/linux-pci/SL2P216MB01878BBCD75F21D882AEEA2880C60@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=203885 Link: https://lore.kernel.org/r/20191112091617.70282-3-mika.westerberg@linux.intel.com Reported-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2019-11-21PCI/PM: Add pcie_wait_for_link_delay()Mika Westerberg1-3/+18
Add pcie_wait_for_link_delay(). Similar to pcie_wait_for_link() but allows passing custom activation delay in milliseconds. Link: https://lore.kernel.org/r/20191112091617.70282-2-mika.westerberg@linux.intel.com Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Return error when changing power state from D3coldBjorn Helgaas1-0/+6
pci_raw_set_power_state() uses the Power Management capability to change a device's power state. The capability is in config space, which is accessible in D0, D1, D2, and D3hot, but not in D3cold. If we call pci_raw_set_power_state() on a device that's in D3cold, config reads fail and return ~0 data, which we erroneously interpreted as "the device is in D3hot", leading to messages like this: pcieport 0000:03:00.0: Refused to change power state, currently in D3 The PCI_PM_CTRL has several RsvdP fields, so ~0 is never a valid register value. If we get that value, print a more informative message and return an error. Changing the power state of a device from D3cold must be done by a platform power management method or some other non-config space mechanism. Link: https://lore.kernel.org/r/20190822200551.129039-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Decode D3cold power state correctlyBjorn Helgaas1-7/+10
Use pci_power_name() to print pci_power_t correctly. This changes: "state 0" or "D0" to "D0" "state 1" or "D1" to "D1" "state 2" or "D2" to "D2" "state 3" or "D3" to "D3hot" "state 4" or "D4" to "D3cold" Changes dmesg logging only, no other functional change intended. Link: https://lore.kernel.org/r/20190822200551.129039-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Fold __pci_complete_power_transition() into its callerRafael J. Wysocki1-23/+7
Because pci_set_power_state() has become the only caller of __pci_complete_power_transition(), there is no need for the latter to be a separate function any more, so fold it into the former, drop a redundant check and reduce the number of lines of code somewhat. Code rearrangement, no intentional functional impact. Link: https://lore.kernel.org/r/15576968.k611qn3UU0@kreacher Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Avoid exporting __pci_complete_power_transition()Rafael J. Wysocki3-5/+5
Notice that radeon_set_suspend(), which is the only caller of __pci_complete_power_transition() outside of pci.c, really only cares about the pci_platform_power_transition() invoked by it, so export the latter instead of it, update the radeon driver to call pci_platform_power_transition() directly and make __pci_complete_power_transition() static. Code rearrangement, no intentional functional impact. Link: https://lore.kernel.org/r/1731661.ykamz2Tiuf@kreacher Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Fold __pci_start_power_transition() into its callerRafael J. Wysocki1-30/+18
Because pci_power_up() has become the only caller of __pci_start_power_transition(), there is no need for the latter to be a separate function any more, so fold it into the former, drop a redundant check and reduce the number of lines of code somewhat. Code rearrangement, no intentional functional impact. Link: https://lore.kernel.org/r/3458080.lsoDbfkST9@kreacher Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Use pci_power_up() in pci_set_power_state()Rafael J. Wysocki2-13/+14
Make it explicitly clear that the code to put devices into D0 in pci_set_power_state() and in pci_pm_default_resume_early() is the same by making the latter use pci_power_up() for transitions into D0. Code rearrangement, no intentional functional impact. Link: https://lore.kernel.org/r/2520019.OZ1nXS5aSj@kreacher Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Move power state update away from pci_power_up()Rafael J. Wysocki2-1/+1
Move the invocation of pci_update_current_state() from pci_power_up() to pci_pm_default_resume_early(), which is the only caller of that function. Preparatory change, no functional impact. Link: https://lore.kernel.org/r/37482337.udjOGdOKNb@kreacher Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com>
2019-11-21PCI/PM: Remove unused pci_driver.suspend_late() hookBjorn Helgaas3-28/+6
The struct pci_driver.suspend_late() hook is one of the legacy PCI power management callbacks, and there are no remaining users of it. Remove it. Link: https://lore.kernel.org/r/20191101204558.210235-7-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Remove unused pci_driver.resume_early() hookBjorn Helgaas3-20/+7
The struct pci_driver.resume_early() hook is one of the legacy PCI power management callbacks, and there are no remaining users of it. Remove it. Link: https://lore.kernel.org/r/20191101204558.210235-6-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21xen-platform: Convert to generic power managementBjorn Helgaas1-5/+9
Convert xen-platform from the legacy PCI power management callbacks to the generic operations. This is one step towards removing support for the legacy PCI callbacks. The generic .resume_noirq() operation is called by pci_pm_resume_noirq() at the same point the legacy PCI .resume_early() callback was, so this patch should not change the xen-platform behavior. Link: https://lore.kernel.org/r/20191101204558.210235-5-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: KarimAllah Ahmed <karahmed@amazon.de>
2019-11-21PCI/PM: Simplify pci_set_power_state()Bjorn Helgaas1-2/+2
Check for the PCI_DEV_FLAGS_NO_D3 quirk early, before calling __pci_start_power_transition(). This way all the cases where we don't need to do anything at all are checked up front. This doesn't fix anything because if the caller requested D3hot or D3cold, __pci_start_power_transition() is a no-op. But calling it is pointless and makes the code harder to analyze. Link: https://lore.kernel.org/r/20191101204558.210235-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Expand PM reset messages to mention D3hot (not just D3)Bjorn Helgaas1-1/+1
pci_pm_reset() resets a device by putting it in D3hot and bringing it back to D0. Clarify related messages to mention "D3hot" explicitly instead of just "D3". Link: https://lore.kernel.org/r/20191101204558.210235-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Apply D2 delay as milliseconds, not microsecondsBjorn Helgaas1-1/+1
PCI_PM_D2_DELAY is defined as 200, which is milliseconds, but previously we used udelay(), which only waited for 200 microseconds. Use msleep() instead so we wait the correct amount of time. See PCIe r5.0, sec 5.9. Link: https://lore.kernel.org/r/20191101204558.210235-2-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Use pci_WARN() to include device informationBjorn Helgaas2-17/+25
Add and use pci_WARN() wrappers so warnings include device information. Link: https://lore.kernel.org/r/20191017212851.54237-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Use PCI dev_printk() wrappers for consistencyBjorn Helgaas1-5/+6
Use the PCI dev_printk() wrappers for consistency with the rest of the PCI core. No functional change intended. Link: https://lore.kernel.org/r/20191017212851.54237-2-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Wrap long lines in documentationBjorn Helgaas1-14/+14
Documentation/power/pci.rst is wrapped to fit in 80 columns, but directory structure changes made a few lines longer. Wrap them so they all fit in 80 columns again. Link: https://lore.kernel.org/r/20191014230016.240912-7-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Note that PME can be generated from D0Bjorn Helgaas1-2/+2
Per PCIe r5.0 sec 7.5.2.1, PME may be generated from D0, so update Documentation/power/pci.rst to reflect that. Link: https://lore.kernel.org/r/20191016194450.68959-1-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Make power management op coding style consistentBjorn Helgaas1-40/+36
Some of the power management ops use this style: struct device_driver *drv = dev->driver; if (drv && drv->pm && drv->pm->prepare(dev)) drv->pm->prepare(dev); while others use this: const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL; if (pm && pm->runtime_resume) pm->runtime_resume(dev); Convert the first style to the second so they're all consistent. Remove local "error" variables when unnecessary. No functional change intended. Link: https://lore.kernel.org/r/20191014230016.240912-6-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Run resume fixups before disabling wakeup eventsBjorn Helgaas1-9/+7
pci_pm_resume() and pci_pm_restore() call pci_pm_default_resume(), which runs resume fixups before disabling wakeup events: static void pci_pm_default_resume(struct pci_dev *pci_dev) { pci_fixup_device(pci_fixup_resume, pci_dev); pci_enable_wake(pci_dev, PCI_D0, false); } pci_pm_runtime_resume() does both of these, but in the opposite order: pci_enable_wake(pci_dev, PCI_D0, false); pci_fixup_device(pci_fixup_resume, pci_dev); We should always use the same ordering unless there's a reason to do otherwise. Change pci_pm_runtime_resume() to call pci_pm_default_resume() instead of open-coding this, so the fixups are always done before disabling wakeup events. pci_pm_default_resume() is called from pci_pm_runtime_resume(), which is under #ifdef CONFIG_PM. If SUSPEND and HIBERNATION are disabled, PM_SLEEP is disabled also, so move pci_pm_default_resume() from #ifdef CONFIG_PM_SLEEP to #ifdef CONFIG_PM. Link: https://lore.kernel.org/r/20191014230016.240912-5-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Clear PCIe PME Status even for legacy power managementBjorn Helgaas1-2/+1
Previously, pci_pm_resume_noirq() cleared the PME Status bit in the Root Status register only if the device had no driver or the driver did not implement legacy power management. It should clear PME Status regardless of what sort of power management the driver supports, so do this before checking for legacy power management. This affects Root Ports and Root Complex Event Collectors, for which the usual driver is the PCIe portdrv, which implements new power management, so this change is just on principle, not to fix any actual defects. Fixes: a39bd851dccf ("PCI/PM: Clear PCIe PME Status bit in core, not PCIe port driver") Link: https://lore.kernel.org/r/20191014230016.240912-4-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Correct pci_pm_thaw_noirq() documentationBjorn Helgaas1-5/+5
According to the documentation, pci_pm_thaw_noirq() did not put the device into the full-power state and restore its standard configuration registers. This is incorrect, so update the documentation to match the code. Link: https://lore.kernel.org/r/20191014230016.240912-3-helgaas@kernel.org Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-21PCI/PM: Always return devices to D0 when thawingDexuan Cui1-6/+11
pci_pm_thaw_noirq() is supposed to return the device to D0 and restore its configuration registers, but previously it only did that for devices whose drivers implemented the new power management ops. Hibernation, e.g., via "echo disk > /sys/power/state", involves freezing devices, creating a hibernation image, thawing devices, writing the image, and powering off. The fact that thawing did not return devices with legacy power management to D0 caused errors, e.g., in this path: pci_pm_thaw_noirq if (pci_has_legacy_pm_support(pci_dev)) # true for Mellanox VF driver return pci_legacy_resume_early(dev) # ... legacy PM skips the rest pci_set_power_state(pci_dev, PCI_D0) pci_restore_state(pci_dev) pci_pm_thaw if (pci_has_legacy_pm_support(pci_dev)) pci_legacy_resume drv->resume mlx4_resume ... pci_enable_msix_range ... if (dev->current_state != PCI_D0) # <--- return -EINVAL; which caused these warnings: mlx4_core a6d1:00:02.0: INTx is not supported in multi-function mode, aborting PM: dpm_run_callback(): pci_pm_thaw+0x0/0xd7 returns -95 PM: Device a6d1:00:02.0 failed to thaw: error -95 Return devices to D0 and restore config registers for all devices, not just those whose drivers support new power management. [bhelgaas: also call pci_restore_state() before pci_legacy_resume_early(), update comment, add stable tag, commit log] Link: https://lore.kernel.org/r/KU1P153MB016637CAEAD346F0AA8E3801BFAD0@KU1P153MB0166.APCP153.PROD.OUTLOOK.COM Signed-off-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: stable@vger.kernel.org # v4.13+
2019-10-20Linux 5.4-rc4v5.4-rc4Linus Torvalds1-1/+1
2019-10-20Merge tag 'kbuild-fixes-v5.4-2' of ↵Linus Torvalds3-6/+9
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull more Kbuild fixes from Masahiro Yamada: - fix a bashism of setlocalversion - do not use the too new --sort option of tar * tag 'kbuild-fixes-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kheaders: substituting --sort in archive creation scripts: setlocalversion: fix a bashism kbuild: update comment about KBUILD_ALLDIRS
2019-10-20Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds6-38/+84
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of x86 fixes: - Prevent a NULL pointer dereference in the X2APIC code in case of a CPU hotplug failure. - Prevent boot failures on HP superdome machines by invalidating the level2 kernel pagetable entries outside of the kernel area as invalid so BIOS reserved space won't be touched unintentionally. Also ensure that memory holes are rounded up to the next PMD boundary correctly. - Enable X2APIC support on Hyper-V to prevent boot failures. - Set the paravirt name when running on Hyper-V for consistency - Move a function under the appropriate ifdef guard to prevent build warnings" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/acpi: Move get_cmdline_acpi_rsdp() under #ifdef guard x86/hyperv: Set pv_info.name to "Hyper-V" x86/apic/x2apic: Fix a NULL pointer deref when handling a dying cpu x86/hyperv: Make vapic support x2apic mode x86/boot/64: Round memory hole size up to next PMD page x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area
2019-10-20Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds5-17/+43
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Thomas Gleixner: "A small set of irq chip driver fixes and updates: - Update the SIFIVE PLIC interrupt driver to use the fasteoi handler to address the shortcomings of the existing flow handling which was prone to lose interrupts - Use the proper limit for GIC interrupt line numbers - Add retrigger support for the recently merged Anapurna Labs Fabric interrupt controller to make it complete - Enable the ATMEL AIC5 interrupt controller driver on the new SAM9X60 SoC" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/sifive-plic: Switch to fasteoi flow irqchip/gic-v3: Fix GIC_LINE_NR accessor irqchip/atmel-aic5: Add support for sam9x60 irqchip irqchip/al-fic: Add support for irq retrigger
2019-10-20Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull hrtimer fixlet from Thomas Gleixner: "A single commit annotating the lockcless access to timer->base with READ_ONCE() and adding the WRITE_ONCE() counterparts for completeness" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: hrtimer: Annotate lockless access to timer->base
2019-10-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds1-4/+6
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull stop-machine fix from Thomas Gleixner: "A single fix, amending stop machine with WRITE/READ_ONCE() to address the fallout of KCSAN" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: stop_machine: Avoid potential race behaviour
2019-10-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds179-892/+1486
Pull networking fixes from David Miller: "I was battling a cold after some recent trips, so quite a bit piled up meanwhile, sorry about that. Highlights: 1) Fix fd leak in various bpf selftests, from Brian Vazquez. 2) Fix crash in xsk when device doesn't support some methods, from Magnus Karlsson. 3) Fix various leaks and use-after-free in rxrpc, from David Howells. 4) Fix several SKB leaks due to confusion of who owns an SKB and who should release it in the llc code. From Eric Biggers. 5) Kill a bunc of KCSAN warnings in TCP, from Eric Dumazet. 6) Jumbo packets don't work after resume on r8169, as the BIOS resets the chip into non-jumbo mode during suspend. From Heiner Kallweit. 7) Corrupt L2 header during MPLS push, from Davide Caratti. 8) Prevent possible infinite loop in tc_ctl_action, from Eric Dumazet. 9) Get register bits right in bcmgenet driver, based upon chip version. From Florian Fainelli. 10) Fix mutex problems in microchip DSA driver, from Marek Vasut. 11) Cure race between route lookup and invalidation in ipv4, from Wei Wang. 12) Fix performance regression due to false sharing in 'net' structure, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (145 commits) net: reorder 'struct net' fields to avoid false sharing net: dsa: fix switch tree list net: ethernet: dwmac-sun8i: show message only when switching to promisc net: aquantia: add an error handling in aq_nic_set_multicast_list net: netem: correct the parent's backlog when corrupted packet was dropped net: netem: fix error path for corrupted GSO frames macb: propagate errors when getting optional clocks xen/netback: fix error path of xenvif_connect_data() net: hns3: fix mis-counting IRQ vector numbers issue net: usb: lan78xx: Connect PHY before registering MAC vsock/virtio: discard packets if credit is not respected vsock/virtio: send a credit update when buffer size is changed mlxsw: spectrum_trap: Push Ethernet header before reporting trap net: ensure correct skb->tstamp in various fragmenters net: bcmgenet: reset 40nm EPHY on energy detect net: bcmgenet: soft reset 40nm EPHYs before MAC init net: phy: bcm7xxx: define soft_reset for 40nm EPHY net: bcmgenet: don't set phydev->link from MAC net: Update address for MediaTek ethernet driver in MAINTAINERS ipv4: fix race condition between route lookup and invalidation ...
2019-10-19net: reorder 'struct net' fields to avoid false sharingEric Dumazet1-8/+17
Intel test robot reported a ~7% regression on TCP_CRR tests that they bisected to the cited commit. Indeed, every time a new TCP socket is created or deleted, the atomic counter net->count is touched (via get_net(net) and put_net(net) calls) So cpus might have to reload a contended cache line in net_hash_mix(net) calls. We need to reorder 'struct net' fields to move @hash_mix in a read mostly cache line. We move in the first cache line fields that can be dirtied often. We probably will have to address in a followup patch the __randomize_layout that was added in linux-4.13, since this might break our placement choices. Fixes: 355b98553789 ("netns: provide pure entropy for net_hash_mix()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: dsa: fix switch tree listVivien Didelot1-1/+1
If there are multiple switch trees on the device, only the last one will be listed, because the arguments of list_add_tail are swapped. Fixes: 83c0afaec7b7 ("net: dsa: Add new binding implementation") Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: ethernet: dwmac-sun8i: show message only when switching to promiscMans Rullgard1-1/+2
Printing the info message every time more than the max number of mac addresses are requested generates unnecessary log spam. Showing it only when the hw is not already in promiscous mode is equally informative without being annoying. Signed-off-by: Mans Rullgard <mans@mansr.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: aquantia: add an error handling in aq_nic_set_multicast_listChenwandun1-0/+2
add an error handling in aq_nic_set_multicast_list, it may not work when hw_multicast_list_set error; and at the same time it will remove gcc Wunused-but-set-variable warning. Signed-off-by: Chenwandun <chenwandun@huawei.com> Reviewed-by: Igor Russkikh <igor.russkikh@aquantia.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19Merge branch 'netem-fix-further-issues-with-packet-corruption'David S. Miller1-3/+8
Jakub Kicinski says: ==================== net: netem: fix further issues with packet corruption This set is fixing two more issues with the netem packet corruption. First patch (which was previously posted) avoids NULL pointer dereference if the first frame gets freed due to allocation or checksum failure. v2 improves the clarity of the code a little as requested by Cong. Second patch ensures we don't return SUCCESS if the frame was in fact dropped. Thanks to this commit message for patch 1 no longer needs the "this will still break with a single-frame failure" disclaimer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: netem: correct the parent's backlog when corrupted packet was droppedJakub Kicinski1-0/+2
If packet corruption failed we jump to finish_segs and return NET_XMIT_SUCCESS. Seeing success will make the parent qdisc increment its backlog, that's incorrect - we need to return NET_XMIT_DROP. Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: netem: fix error path for corrupted GSO framesJakub Kicinski1-3/+6
To corrupt a GSO frame we first perform segmentation. We then proceed using the first segment instead of the full GSO skb and requeue the rest of the segments as separate packets. If there are any issues with processing the first segment we still want to process the rest, therefore we jump to the finish_segs label. Commit 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") started using the pointer to the first segment in the "rest of segments processing", but as mentioned above the first segment may had already been freed at this point. Backlog corrections for parent qdiscs have to be adjusted. Fixes: 177b8007463c ("net: netem: fix backlog accounting for corrupted GSO frames") Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19macb: propagate errors when getting optional clocksMichael Tretter1-6/+6
The tx_clk, rx_clk, and tsu_clk are optional. Currently the macb driver marks clock as not available if it receives an error when trying to get a clock. This is wrong, because a clock controller might return -EPROBE_DEFER if a clock is not available, but will eventually become available. In these cases, the driver would probe successfully but will never be able to adjust the clocks, because the clocks were not available during probe, but became available later. For example, the clock controller for the ZynqMP is implemented in the PMU firmware and the clocks are only available after the firmware driver has been probed. Use devm_clk_get_optional() in instead of devm_clk_get() to get the optional clock and propagate all errors to the calling function. Signed-off-by: Michael Tretter <m.tretter@pengutronix.de> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Tested-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19xen/netback: fix error path of xenvif_connect_data()Juergen Gross1-1/+0
xenvif_connect_data() calls module_put() in case of error. This is wrong as there is no related module_get(). Remove the superfluous module_put(). Fixes: 279f438e36c0a7 ("xen-netback: Don't destroy the netdev until the vif is shut down") Cc: <stable@vger.kernel.org> # 3.12 Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19net: hns3: fix mis-counting IRQ vector numbers issueYonglong Liu6-6/+58
Currently, the num_msi_left means the vector numbers of NIC, but if the PF supported RoCE, it contains the vector numbers of NIC and RoCE(Not expected). This may cause interrupts lost in some case, because of the NIC module used the vector resources which belongs to RoCE. This patch adds a new variable num_nic_msi to store the vector numbers of NIC, and adjust the default TQP numbers and rss_size according to the value of num_nic_msi. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19Merge branch 'akpm' (patches from Andrew)Linus Torvalds27-139/+165
Merge misc fixes from Andrew Morton: "Rather a lot of fixes, almost all affecting mm/" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (26 commits) scripts/gdb: fix debugging modules on s390 kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe register mm/thp: allow dropping THP from page cache mm/vmscan.c: support removing arbitrary sized pages from mapping mm/thp: fix node page state in split_huge_page_to_list() proc/meminfo: fix output alignment mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition mm: include <linux/huge_mm.h> for is_vma_temporary_stack zram: fix race between backing_dev_show and backing_dev_store mm/memcontrol: update lruvec counters in mem_cgroup_move_account ocfs2: fix panic due to ocfs2_wq is null hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() mm: memblock: do not enforce current limit for memblock_phys* family mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size mm/gup: fix a misnamed "write" argument, and a related bug mm/gup_benchmark: add a missing "w" to getopt string ocfs2: fix error handling in ocfs2_setattr() mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release mm/memunmap: don't access uninitialized memmap in memunmap_pages() ...
2019-10-19scripts/gdb: fix debugging modules on s390Ilya Leoshkevich1-1/+7
Currently lx-symbols assumes that module text is always located at module->core_layout->base, but s390 uses the following layout: +------+ <- module->core_layout->base | GOT | +------+ <- module->core_layout->base + module->arch->plt_offset | PLT | +------+ <- module->core_layout->base + module->arch->plt_offset + | TEXT | module->arch->plt_size +------+ Therefore, when trying to debug modules on s390, all the symbol addresses are skewed by plt_offset + plt_size. Fix by adding plt_offset + plt_size to module_addr in load_module_symbols(). Link: http://lkml.kernel.org/r/20191017085917.81791-1-iii@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19kernel/events/uprobes.c: only do FOLL_SPLIT_PMD for uprobe registerSong Liu1-2/+11
Attaching uprobe to text section in THP splits the PMD mapped page table into PTE mapped entries. On uprobe detach, we would like to regroup PMD mapped page table entry to regain performance benefit of THP. However, the regroup is broken For perf_event based trace_uprobe. This is because perf_event based trace_uprobe calls uprobe_unregister twice on close: first in TRACE_REG_PERF_CLOSE, then in TRACE_REG_PERF_UNREGISTER. The second call will split the PMD mapped page table entry, which is not the desired behavior. Fix this by only use FOLL_SPLIT_PMD for uprobe register case. Add a WARN() to confirm uprobe unregister never work on huge pages, and abort the operation when this WARN() triggers. Link: http://lkml.kernel.org/r/20191017164223.2762148-6-songliubraving@fb.com Fixes: 5a52c9df62b4 ("uprobe: use FOLL_SPLIT_PMD instead of FOLL_SPLIT") Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19mm/thp: allow dropping THP from page cacheKirill A. Shutemov1-0/+12
Once a THP is added to the page cache, it cannot be dropped via /proc/sys/vm/drop_caches. Fix this issue with proper handling in invalidate_mapping_pages(). Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19mm/vmscan.c: support removing arbitrary sized pages from mappingWilliam Kucharski1-4/+1
__remove_mapping() assumes that pages can only be either base pages or HPAGE_PMD_SIZE. Ask the page what size it is. Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Signed-off-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19mm/thp: fix node page state in split_huge_page_to_list()Kirill A. Shutemov1-2/+7
Make sure split_huge_page_to_list() handles the state of shmem THP and file THP properly. Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19proc/meminfo: fix output alignmentKirill A. Shutemov1-2/+2
Patch series "Fixes for THP in page cache", v2. This patch (of 5): Add extra space for FileHugePages and FilePmdMapped, so the output is aligned with other rows. Link: http://lkml.kernel.org/r/20191017164223.2762148-2-songliubraving@fb.com Fixes: 60fbf0ab5da1 ("mm,thp: stats for file backed THP") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Tested-by: Song Liu <songliubraving@fb.com> Acked-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-19mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batchBen Dooks (Codethink)1-0/+1
mm_init.c needs to include <linux/mman.h> for the definition of vm_committed_as_batch. Fixes the following sparse warning: mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.uk Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>