Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
- Asynchronous address range scrub:
Given the capacities of next generation persistent memory devices a
scrub operation to find all poison may take 10s of seconds. We
want this scrub work to be done asynchronously with the rest of
system initialization, so we move it out of line from the NFIT
probing, i.e. acpi_nfit_add().
- Clear poison:
ACPI 6.1 introduces the ability to send "clear error" commands to
the ACPI0012:00 device representing the root of an "nvdimm bus".
Similar to relocating a bad block on a disk, this support clears
media errors in response to a write.
- Persistent memory resource tracking:
A persistent memory range may be designated as simply "reserved" by
platform firmware in the efi/e820 memory map. Later when the NFIT
driver loads it discovers that the range is "Persistent Memory".
The NFIT bus driver inserts a resource to advertise that
"persistent" attribute in the system resource tree for /proc/iomem
and kernel-internal usages.
- Miscellaneous cleanups and fixes:
Workaround section misaligned pmem ranges when allocating a struct
page memmap, fix handling of the read-only case in the ioctl path,
and clean up block device major number allocation.
* tag 'libnvdimm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (26 commits)
libnvdimm, pmem: clear poison on write
libnvdimm, pmem: fix kmap_atomic() leak in error path
nvdimm/btt: don't allocate unused major device number
nvdimm/blk: don't allocate unused major device number
pmem: don't allocate unused major device number
ACPI: Change NFIT driver to insert new resource
resource: Export insert_resource and remove_resource
resource: Add remove_resource interface
resource: Change __request_region to inherit from immediate parent
libnvdimm, pmem: fix ia64 build, use PHYS_PFN
nfit, libnvdimm: clear poison command support
libnvdimm, pfn: 'resource'-address and 'size' attributes for pfn devices
libnvdimm, pmem: adjust for section collisions with 'System RAM'
libnvdimm, pmem: fix 'pfn' support for section-misaligned namespaces
libnvdimm: Fix security issue with DSM IOCTL.
libnvdimm: Clean-up access mode check.
tools/testing/nvdimm: expand ars unit testing
nfit: disable userspace initiated ars during scrub
nfit: scrub and register regions in a workqueue
nfit, libnvdimm: async region scrub workqueue
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- Most attention this cycle went to optimizing blk-mq request-based DM
(dm-mq) that is used exclussively by DM multipath:
- A stable fix for dm-mq that eliminates excessive context
switching offers the biggest performance improvement (for both
IOPs and throughput).
- But more work is needed, during the next cycle, to reduce
spinlock contention in DM multipath on large NUMA systems.
- A stable fix for a NULL pointer seen when DM stats is enabled on a DM
multipath device that must requeue an IO due to path failure.
- A stable fix for DM snapshot to disallow the COW and origin devices
from being identical. This amounts to graceful failure in the face
of userspace error because these devices shouldn't ever be identical.
- Stable fixes for DM cache and DM thin provisioning to address crashes
seen if/when their respective metadata device experiences failures
that cause the transition to 'fail_io' mode.
- The DM cache 'mq' policy is now an alias for the 'smq' policy. The
'smq' policy proved to be consistently better than 'mq'. As such
'mq', with all its complex user-facing tunables, has been eliminated.
- Improve DM thin provisioning to consistently return -ENOSPC once the
thin-pool's data volume is out of space.
- Improve DM core to properly handle error propagation if
bio_integrity_clone() fails in clone_bio().
- Other small cleanups and improvements to DM core.
* tag 'dm-4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (41 commits)
dm: fix rq_end_stats() NULL pointer in dm_requeue_original_request()
dm thin: consistently return -ENOSPC if pool has run out of data space
dm cache: bump the target version
dm cache: make sure every metadata function checks fail_io
dm: add missing newline between DM_DEBUG_BLOCK_STACK_TRACING and DM_BUFIO
dm cache policy smq: clarify that mq registration failure was for 'mq'
dm: return error if bio_integrity_clone() fails in clone_bio()
dm thin metadata: don't issue prefetches if a transaction abort has failed
dm snapshot: disallow the COW and origin devices from being identical
dm cache: make the 'mq' policy an alias for 'smq'
dm: drop unnecessary assignment of md->queue
dm: reorder 'struct mapped_device' members to fix alignment and holes
dm: remove dummy definition of 'struct dm_table'
dm: add 'dm_numa_node' module parameter
dm thin metadata: remove needless newline from subtree_dec() DMERR message
dm mpath: cleanup reinstate_path() et al based on code review
dm mpath: remove __pgpath_busy forward declaration, rename to pgpath_busy
dm mpath: switch from 'unsigned' to 'bool' for flags where appropriate
dm round robin: use percpu 'repeat_count' and 'current_path'
dm path selector: remove 'repeat_count' return from .select_path hook
...
|
|
Pull SCSI updates from James Bottomley:
"This pull includes driver updates from the usual suspects (stex, hpsa,
ncr5380, scsi_dh, qla2xxx, be2iscsi, hisi_sas, cxlflash, aacraid,
mp3sas, megaraid_sas, ibmvscsi, ufs) plus an assortment of
miscellaneous fixes.
The major user visible change of this pull is that we've moved from
monotonically increasing host number to an ida allocated one (meaning
the numbers get re-used) because someone managed to wrap the count in
an iscsi system. We don't believe there will be any adverse
consequences of this"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (230 commits)
MAINTAINERS: use new email address for James Bottomley
mpt3sas: Remove unnecessary synchronize_irq() before free_irq()
sg: fix dxferp in from_to case
cxlflash: Increase cmd_per_lun for better throughput
cxlflash: Fix to avoid unnecessary scan with internal LUNs
cxlflash: Reorder user context initialization
cxlflash: Simplify attach path error cleanup
cxlflash: Split out context initialization
cxlflash: Unmap problem state area before detaching master context
cxlflash: Simplify PCI registration
scsi: storvsc: fix SRB_STATUS_ABORTED handling
be2iscsi: set the boot_kset pointer to NULL in case of failure
sd: Fix discard granularity when LBPRZ=1
be2iscsi: Remove unnecessary synchronize_irq() before free_irq()
scsi_sysfs: call 'device_add' after attaching device handler
scsi_dh_emc: update 'access_state' field
scsi_dh_rdac: update 'access_state' field
scsi_dh_alua: update 'access_state' field
scsi_dh_alua: use common definitions for ALUA state
scsi: Add 'access_state' and 'preferred_path' attribute
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft
Pull iscsi_ibft update from Konrad Rzeszutek Wilk:
"A simple patch that had been rattling around in SuSE repo"
* 'stable/for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft:
iscsi_ibft: Add prefix-len attr and display netmask
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"PCI changes for v4.6:
Enumeration:
- Disable IO/MEM decoding for devices with non-compliant BARs (Bjorn Helgaas)
- Mark Broadwell-EP Home Agent & PCU as having non-compliant BARs (Bjorn Helgaas
Resource management:
- Mark shadow copy of VGA ROM as IORESOURCE_PCI_FIXED (Bjorn Helgaas)
- Don't assign or reassign immutable resources (Bjorn Helgaas)
- Don't enable/disable ROM BAR if we're using a RAM shadow copy (Bjorn Helgaas)
- Set ROM shadow location in arch code, not in PCI core (Bjorn Helgaas)
- Remove arch-specific IORESOURCE_ROM_SHADOW size from sysfs (Bjorn Helgaas)
- ia64: Use ioremap() instead of open-coded equivalent (Bjorn Helgaas)
- ia64: Keep CPU physical (not virtual) addresses in shadow ROM resource (Bjorn Helgaas)
- MIPS: Keep CPU physical (not virtual) addresses in shadow ROM resource (Bjorn Helgaas)
- Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY (Bjorn Helgaas)
- Don't leak memory if sysfs_create_bin_file() fails (Bjorn Helgaas)
- rcar: Remove PCI_PROBE_ONLY handling (Lorenzo Pieralisi)
- designware: Remove PCI_PROBE_ONLY handling (Lorenzo Pieralisi)
Virtualization:
- Wait for up to 1000ms after FLR reset (Alex Williamson)
- Support SR-IOV on any function type (Kelly Zytaruk)
- Add ACS quirk for all Cavium devices (Manish Jaggi)
AER:
- Rename pci_ops_aer to aer_inj_pci_ops (Bjorn Helgaas)
- Restore pci_ops pointer while calling original pci_ops (David Daney)
- Fix aer_inject error codes (Jean Delvare)
- Use dev_warn() in aer_inject (Jean Delvare)
- Log actual error causes in aer_inject (Jean Delvare)
- Log aer_inject error injections (Jean Delvare)
VPD:
- Prevent VPD access for buggy devices (Babu Moger)
- Move pci_read_vpd() and pci_write_vpd() close to other VPD code (Bjorn Helgaas)
- Move pci_vpd_release() from header file to pci/access.c (Bjorn Helgaas)
- Remove struct pci_vpd_ops.release function pointer (Bjorn Helgaas)
- Rename VPD symbols to remove unnecessary "pci22" (Bjorn Helgaas)
- Fold struct pci_vpd_pci22 into struct pci_vpd (Bjorn Helgaas)
- Sleep rather than busy-wait for VPD access completion (Bjorn Helgaas)
- Update VPD definitions (Hannes Reinecke)
- Allow access to VPD attributes with size 0 (Hannes Reinecke)
- Determine actual VPD size on first access (Hannes Reinecke)
Generic host bridge driver:
- Move structure definitions to separate header file (David Daney)
- Add pci_host_common_probe(), based on gen_pci_probe() (David Daney)
- Expose pci_host_common_probe() for use by other drivers (David Daney)
Altera host bridge driver:
- Fix altera_pcie_link_is_up() (Ley Foon Tan)
Cavium ThunderX host bridge driver:
- Add PCIe host driver for ThunderX processors (David Daney)
- Add driver for ThunderX-pass{1,2} on-chip devices (David Daney)
Freescale i.MX6 host bridge driver:
- Add DT bindings to configure PHY Tx driver settings (Justin Waters)
- Move imx6_pcie_reset_phy() near other PHY handling functions (Lucas Stach)
- Move PHY reset into imx6_pcie_establish_link() (Lucas Stach)
- Remove broken Gen2 workaround (Lucas Stach)
- Move link up check into imx6_pcie_wait_for_link() (Lucas Stach)
Freescale Layerscape host bridge driver:
- Add "fsl,ls2085a-pcie" compatible ID (Yang Shi)
Intel VMD host bridge driver:
- Attach VMD resources to parent domain's resource tree (Jon Derrick)
- Set bus resource start to 0 (Keith Busch)
Microsoft Hyper-V host bridge driver:
- Add fwnode_handle to x86 pci_sysdata (Jake Oshins)
- Look up IRQ domain by fwnode_handle (Jake Oshins)
- Add paravirtual PCI front-end for Microsoft Hyper-V VMs (Jake Oshins)
NVIDIA Tegra host bridge driver:
- Add pci_ops.{add,remove}_bus() callbacks (Thierry Reding)
- Implement ->{add,remove}_bus() callbacks (Thierry Reding)
- Remove unused struct tegra_pcie.num_ports field (Thierry Reding)
- Track bus -> CPU mapping (Thierry Reding)
- Remove misleading PHYS_OFFSET (Thierry Reding)
Renesas R-Car host bridge driver:
- Depend on ARCH_RENESAS, not ARCH_SHMOBILE (Simon Horman)
Synopsys DesignWare host bridge driver:
- ARC: Add PCI support (Joao Pinto)
- Add generic dw_pcie_wait_for_link() (Joao Pinto)
- Add default link up check if sub-driver doesn't override (Joao Pinto)
- Add driver for prototyping kits based on ARC SDP (Joao Pinto)
TI Keystone host bridge driver:
- Defer probing if devm_phy_get() returns -EPROBE_DEFER (Shawn Lin)
Xilinx AXI host bridge driver:
- Use of_pci_get_host_bridge_resources() to parse DT (Bharat Kumar Gogada)
- Remove dependency on ARM-specific struct hw_pci (Bharat Kumar Gogada)
- Don't call pci_fixup_irqs() on Microblaze (Bharat Kumar Gogada)
- Update Zynq binding with Microblaze node (Bharat Kumar Gogada)
- microblaze: Support generic Xilinx AXI PCIe Host Bridge IP driver (Bharat Kumar Gogada)
Xilinx NWL host bridge driver:
- Add support for Xilinx NWL PCIe Host Controller (Bharat Kumar Gogada)
Miscellaneous:
- Check device_attach() return value always (Bjorn Helgaas)
- Move pci_set_flags() from asm-generic/pci-bridge.h to linux/pci.h (Bjorn Helgaas)
- Remove includes of empty asm-generic/pci-bridge.h (Bjorn Helgaas)
- ARM64: Remove generated include of asm-generic/pci-bridge.h (Bjorn Helgaas)
- Remove empty asm-generic/pci-bridge.h (Bjorn Helgaas)
- Remove includes of asm/pci-bridge.h (Bjorn Helgaas)
- Consolidate PCI DMA constants and interfaces in linux/pci-dma-compat.h (Bjorn Helgaas)
- unicore32: Remove unused HAVE_ARCH_PCI_SET_DMA_MASK definition (Bjorn Helgaas)
- Cleanup pci/pcie/Kconfig whitespace (Andreas Ziegler)
- Include pci/hotplug Kconfig directly from pci/Kconfig (Bjorn Helgaas)
- Include pci/pcie/Kconfig directly from pci/Kconfig (Bogicevic Sasa)
- frv: Remove stray pci_{alloc,free}_consistent() declaration (Christoph Hellwig)
- Move pci_dma_* helpers to common code (Christoph Hellwig)
- Add PCI_CLASS_SERIAL_USB_DEVICE definition (Heikki Krogerus)
- Add QEMU top-level IDs for (sub)vendor & device (Robin H. Johnson)
- Fix broken URL for Dell biosdevname (Naga Venkata Sai Indubhaskar Jupudi)"
* tag 'pci-v4.6-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (94 commits)
PCI: Add PCI_CLASS_SERIAL_USB_DEVICE definition
PCI: designware: Add driver for prototyping kits based on ARC SDP
PCI: designware: Add default link up check if sub-driver doesn't override
PCI: designware: Add generic dw_pcie_wait_for_link()
PCI: Cleanup pci/pcie/Kconfig whitespace
PCI: Simplify pci_create_attr() control flow
PCI: Don't leak memory if sysfs_create_bin_file() fails
PCI: Simplify sysfs ROM cleanup
PCI: Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY
MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in shadow ROM resource
MIPS: Loongson 3: Use temporary struct resource * to avoid repetition
ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM resource
ia64/PCI: Use ioremap() instead of open-coded equivalent
ia64/PCI: Use temporary struct resource * to avoid repetition
PCI: Clean up pci_map_rom() whitespace
PCI: Remove arch-specific IORESOURCE_ROM_SHADOW size from sysfs
PCI: thunder: Add driver for ThunderX-pass{1,2} on-chip devices
PCI: thunder: Add PCIe host driver for ThunderX processors
PCI: generic: Expose pci_host_common_probe() for use by other drivers
PCI: generic: Add pci_host_common_probe(), based on gen_pci_probe()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"This time the majority of changes go into cpufreq and they are
significant.
First off, the way CPU frequency updates are triggered is different
now. Instead of having to set up and manage a deferrable timer for
each CPU in the system to evaluate and possibly change its frequency
periodically, cpufreq governors set up callbacks to be invoked by the
scheduler on a regular basis (basically on utilization updates). The
"old" governors, "ondemand" and "conservative", still do all of their
work in process context (although that is triggered by the scheduler
now), but intel_pstate does it all in the callback invoked by the
scheduler with no need for any additional asynchronous processing.
Of course, this eliminates the overhead related to the management of
all those timers, but also it allows the cpufreq governor code to be
simplified quite a bit. On top of that, the common code and data
structures used by the "ondemand" and "conservative" governors are
cleaned up and made more straightforward and some long-standing and
quite annoying problems are addressed. In particular, the handling of
governor sysfs attributes is modified and the related locking becomes
more fine grained which allows some concurrency problems to be avoided
(particularly deadlocks with the core cpufreq code).
In principle, the new mechanism for triggering frequency updates
allows utilization information to be passed from the scheduler to
cpufreq. Although the current code doesn't make use of it, in the
works is a new cpufreq governor that will make decisions based on the
scheduler's utilization data. That should allow the scheduler and
cpufreq to work more closely together in the long run.
In addition to the core and governor changes, cpufreq drivers are
updated too. Fixes and optimizations go into intel_pstate, the
cpufreq-dt driver is updated on top of some modification in the
Operating Performance Points (OPP) framework and there are fixes and
other updates in the powernv cpufreq driver.
Apart from the cpufreq updates there is some new ACPICA material,
including a fix for a problem introduced by previous ACPICA updates,
and some less significant changes in the ACPI code, like CPPC code
optimizations, ACPI processor driver cleanups and support for loading
ACPI tables from initrd.
Also updated are the generic power domains framework, the Intel RAPL
power capping driver and the turbostat utility and we have a bunch of
traditional assorted fixes and cleanups.
Specifics:
- Redesign of cpufreq governors and the intel_pstate driver to make
them use callbacks invoked by the scheduler to trigger CPU
frequency evaluation instead of using per-CPU deferrable timers for
that purpose (Rafael Wysocki).
- Reorganization and cleanup of cpufreq governor code to make it more
straightforward and fix some concurrency problems in it (Rafael
Wysocki, Viresh Kumar).
- Cleanup and improvements of locking in the cpufreq core (Viresh
Kumar).
- Assorted cleanups in the cpufreq core (Rafael Wysocki, Viresh
Kumar, Eric Biggers).
- intel_pstate driver updates including fixes, optimizations and a
modification to make it enable enable hardware-coordinated P-state
selection (HWP) by default if supported by the processor (Philippe
Longepe, Srinivas Pandruvada, Rafael Wysocki, Viresh Kumar, Felipe
Franciosi).
- Operating Performance Points (OPP) framework updates to improve its
handling of voltage regulators and device clocks and updates of the
cpufreq-dt driver on top of that (Viresh Kumar, Jon Hunter).
- Updates of the powernv cpufreq driver to fix initialization and
cleanup problems in it and correct its worker thread handling with
respect to CPU offline, new powernv_throttle tracepoint (Shilpasri
Bhat).
- ACPI cpufreq driver optimization and cleanup (Rafael Wysocki).
- ACPICA updates including one fix for a regression introduced by
previos changes in the ACPICA code (Bob Moore, Lv Zheng, David Box,
Colin Ian King).
- Support for installing ACPI tables from initrd (Lv Zheng).
- Optimizations of the ACPI CPPC code (Prashanth Prakash, Ashwin
Chaugule).
- Support for _HID(ACPI0010) devices (ACPI processor containers) and
ACPI processor driver cleanups (Sudeep Holla).
- Support for ACPI-based enumeration of the AMBA bus (Graeme Gregory,
Aleksey Makarov).
- Modification of the ACPI PCI IRQ management code to make it treat
255 in the Interrupt Line register as "not connected" on x86 (as
per the specification) and avoid attempts to use that value as a
valid interrupt vector (Chen Fan).
- ACPI APEI fixes related to resource leaks (Josh Hunt).
- Removal of modularity from a few ACPI drivers (BGRT, GHES,
intel_pmic_crc) that cannot be built as modules in practice (Paul
Gortmaker).
- PNP framework update to make it treat ACPI_RESOURCE_TYPE_SERIAL_BUS
as a valid resource type (Harb Abdulhamid).
- New device ID (future AMD I2C controller) in the ACPI driver for
AMD SoCs (APD) and in the designware I2C driver (Xiangliang Yu).
- Assorted ACPI cleanups (Colin Ian King, Kaiyen Chang, Oleg Drokin).
- cpuidle menu governor optimization to avoid a square root
computation in it (Rasmus Villemoes).
- Fix for potential use-after-free in the generic device properties
framework (Heikki Krogerus).
- Updates of the generic power domains (genpd) framework including
support for multiple power states of a domain, fixes and debugfs
output improvements (Axel Haslam, Jon Hunter, Laurent Pinchart,
Geert Uytterhoeven).
- Intel RAPL power capping driver updates to reduce IPI overhead in
it (Jacob Pan).
- System suspend/hibernation code cleanups (Eric Biggers, Saurabh
Sengar).
- Year 2038 fix for the process freezer (Abhilash Jindal).
- turbostat utility updates including new features (decoding of more
registers and CPUID fields, sub-second intervals support, GFX MHz
and RC6 printout, --out command line option), fixes (syscall jitter
detection and workaround, reductioin of the number of syscalls
made, fixes related to Xeon x200 processors, compiler warning
fixes) and cleanups (Len Brown, Hubert Chrzaniuk, Chen Yu)"
* tag 'pm+acpi-4.6-rc1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (182 commits)
tools/power turbostat: bugfix: TDP MSRs print bits fixing
tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
tools/power turbostat: call __cpuid() instead of __get_cpuid()
tools/power turbostat: indicate SMX and SGX support
tools/power turbostat: detect and work around syscall jitter
tools/power turbostat: show GFX%rc6
tools/power turbostat: show GFXMHz
tools/power turbostat: show IRQs per CPU
tools/power turbostat: make fewer systems calls
tools/power turbostat: fix compiler warnings
tools/power turbostat: add --out option for saving output in a file
tools/power turbostat: re-name "%Busy" field to "Busy%"
tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
tools/power turbostat: allow sub-sec intervals
ACPI / APEI: ERST: Fixed leaked resources in erst_init
ACPI / APEI: Fix leaked resources
intel_pstate: Do not skip samples partially
intel_pstate: Remove freq calculation from intel_pstate_calc_busy()
intel_pstate: Move intel_pstate_calc_busy() into get_target_pstate_use_performance()
...
|
|
Merge first patch-bomb from Andrew Morton:
- some misc things
- ofs2 updates
- about half of MM
- checkpatch updates
- autofs4 update
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
autofs4: fix string.h include in auto_dev-ioctl.h
autofs4: use pr_xxx() macros directly for logging
autofs4: change log print macros to not insert newline
autofs4: make autofs log prints consistent
autofs4: fix some white space errors
autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked()
autofs4: fix coding style line length in autofs4_wait()
autofs4: fix coding style problem in autofs4_get_set_timeout()
autofs4: coding style fixes
autofs: show pipe inode in mount options
kallsyms: add support for relative offsets in kallsyms address table
kallsyms: don't overload absolute symbol type for percpu symbols
x86: kallsyms: disable absolute percpu symbols on !SMP
checkpatch: fix another left brace warning
checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses
checkpatch: warn on bare unsigned or signed declarations without int
checkpatch: exclude asm volatile from complex macro check
mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate()
mm: migrate: consolidate mem_cgroup_migrate() calls
mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous
...
|
|
Pull KVM updates from Paolo Bonzini:
"One of the largest releases for KVM... Hardly any generic
changes, but lots of architecture-specific updates.
ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using
vector hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest
memory - currently its only use is to speedup the legacy shadow
paging (pre-EPT) case, but in the future it will be used for
virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
KVM: x86: disable MPX if host did not enable MPX XSAVE features
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
arm64: KVM: vgic-v3: Reset LRs at boot time
arm64: KVM: vgic-v3: Do not save an LR known to be empty
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
arm64: KVM: vgic-v3: Avoid accessing ICH registers
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
KVM: arm/arm64: vgic-v2: Reset LRs at boot time
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
KVM: s390: allocate only one DMA page per VM
KVM: s390: enable STFLE interpretation only if enabled for the guest
KVM: s390: wake up when the VCPU cpu timer expires
KVM: s390: step the VCPU timer while in enabled wait
KVM: s390: protect VCPU cpu timer with a seqcount
KVM: s390: step VCPU cpu timer during kvm_run ioctl
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
Pull LED updates from Jacek Anaszewski:
"LED core improvements:
- Fix misleading comment after workqueue removal from drivers
- Avoid error message when a USB LED device is unplugged
- Add helpers for calling brightness_set(_blocking)
LED triggers:
- Simplify led_trigger_store by using sysfs_streq()
LED class drivers improvements:
- Improve wording and formatting in a comment: lp3944
- Fix return value check in create_gpio_led(): leds-gpio
- Use GPIOF_OUT_INIT_LOW instead of hardcoded zero: leds-gpio
- Use devm_led_classdev_register(): leds-lm3533, leds-lm3533,
leds-lp8788, leds-wm831x-status, leds-s3c24xx, leds-s3c24xx,
leds-max8997.
New LED class driver:
- Add driver for the ISSI IS31FL32xx family of LED controllers.
Device Tree documentation:
- of: Add vendor prefixes for Integrated Silicon Solutions Inc.
(issi) and Si-En Technology (si-en).
- DT: Add common bindings for Si-En Technology SN3216/18 and
IS31FL32xx family of LED controllers, since they seem to be the
same hardware, just rebranded"
* tag 'leds_for_4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds:
leds: triggers: simplify led_trigger_store
leds: max8997: Use devm_led_classdev_register
leds: da903x: Use devm_led_classdev_register
leds: s3c24xx: Use devm_led_classdev_register
leds: wm831x-status: Use devm_led_classdev_register
leds: lp8788: Use devm_led_classdev_register
leds: 88pm860x: Use devm_led_classdev_register
leds: Add SN3218 and SN3216 support to the IS31FL32XX driver
of: Add vendor prefix for Si-En Technology
leds: Add driver for the ISSI IS31FL32xx family of LED controllers
DT: leds: Add binding for the ISSI IS31FL32xx family of LED controllers
DT: Add vendor prefix for Integrated Silicon Solutions Inc.
leds: lm3533: Use devm_led_classdev_register
leds: gpio: Use GPIOF_OUT_INIT_LOW instead of hardcoded zero
leds: core: add helpers for calling brightness_set(_blocking)
leds: leds-gpio: Fix return value check in create_gpio_led()
leds: lp3944: improve wording and formatting in a comment
leds: core: avoid error message when a USB LED device is unplugged
leds: core: fix misleading comment after workqueue removal from drivers
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Core:
- New sysfs interface to set and read clock offset
- Drivers can now be both I2C and SPI (see pcf2127 and ds3232)
New drivers:
- Alphascale ASM9260
- Epson RX6110SA
- Maxim max20024 and max77620 (in max77686)
- Microchip PIC32
- NXP pcf2129 (in pcf2127)
Subsystem wide cleanups:
- remove IRQF_EARLY_RESUME when unecessary
Drivers:
- ds1307: clock output, temperature sensor and wakeup-source support
- ds1685: actually spin forever in poweroff error path
- ds3232: many cleanups
- ds3234: merged in ds3232
- hym8563: fix invalid year calculation
- max77686: many cleanups
- max77802 merged in max77686
- pcf2123: cleanups and offset support
- pcf85063: cleanups
- pcf8523: propely handle oscillator stop bit
- rv3029: many cleanups, trickle charger and temperature sensor support
- rv8803: convert spin_lock to mutex_lock
- rx8025: many fixes
- vr41xx: restore alarm_irq_enable"
* tag 'rtc-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (86 commits)
rtc: pcf2127: add pcf2129 device id
rtc: pcf2127: add support for spi interface
rtc: pcf2127: convert to use regmap
rtc: rv3029: Add thermometer hwmon support
rtc: rv3029: Add update_bits helper for eeprom access
rtc: ds1685: actually spin forever in poweroff error path
rtc: hym8563: fix invalid year calculation
rtc: ds3232: use rtc->ops_lock to protect alarm operations
rtc: ds3232: fix issue when irq is shared several devices
rtc: ds3232: remove unused UIE code
rtc: ds3232: add register access error checks
rtc: ds3232: fix read on /dev/rtc after RTC_AIE_ON
rtc: merge ds3232 and ds3234
rtc: ds3232: convert to use regmap
rtc: pxa: fix Kconfig indentation
rtc: rv3029: Add device tree property for trickle charger
rtc: rv3029: Add functions for EEPROM access
rtc: rv3029: Add i2c register update-bits helper
rtc: rv3029: Add missing register definitions
rtc: rv3029: Add "rv3029" I2C device id
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon updates from Guenter Roeck:
- New drivers for NSA320 and LTC2990
- Added support for ADM1278 to adm1275 driver
- Added support for ncpXXxh103 to ntc_thermistor driver
- Renamed vexpress hwmon implementation
- Minor cleanups and improvements
* tag 'hwmon-for-linus-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: Create an NSA320 hardware monitoring driver
hwmon: Define binding for the nsa320-hwmon driver
hwmon: (adm1275) Add support for ADM1278
hwmon: (ntc_thermistor) Add support for ncpXXxh103
Doc: hwmon: Fix typo "montoring" in hwmon
ARM: dts: vfxxx: Add iio_hwmon node for ADC temperature channel
ARM: dts: Change iio_hwmon nodes to use hypen in node names
hwmon: (iio_hwmon) Allow the driver to accept hypen in device tree node names
hwmon: Add LTC2990 sensor driver
hwmon: (vexpress) rename vexpress hwmon implementation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator updates from Mark Brown:
"This has been an extremely quiet release for the regulator API, aside
from bugfixes and small enhancements the only thing that really stands
out are the new drivers for Action Semiconductors ACT8945A, HiSilicon
HI665x, and the Maxim MAX20024 and MAX77620"
* tag 'regulator-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator: (46 commits)
regulator: pwm: Add support to have multiple instance of pwm regulator
regulator: pwm: Fix calculation of voltage-to-duty cycle
regulator: of: Use of_property_read_u32() for reading min/max
regulator: pv88060: fix incorrect clear of event register
regulator: pv88090: fix incorrect clear of event register
regulator: max77620: Add support to configure active-discharge
regulator: core: Add support for active-discharge configuration
regulator: helper: Add helper to configure active-discharge using regmap
regulator: core: Add support for active-discharge configuration
regulator: DT: Add DT property for active-discharge configuration
regulator: act8865: Specify fixed voltage of 3.3V for ACT8600's REG9
regulator: act8865: Rename platform_data field to init_data
regulator: act8865: Remove "static" from local variable
ASoC: cs4271: add regulator consumer support
regulator: max77620: Remove duplicate module alias
regulator: max77620: Eliminate duplicate code
regulator: max77620: Remove unused fields
regulator: core: fix crash in error path of regulator_register
regulator: core: Request GPIO before creating sysfs entries
regulator: gpio: don't print error on EPROBE_DEFER
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"This has been a very busy release for regmap, not just in cleaning up
the mess we got ourselves into with the endianness handling but also
in other areas too:
- Fixes for the endianness handling so that we now explicitly default
to little endian (the code used to do this by accident). This
fixes handling of explictly specified endianness on big endian
systems.
- Optimisation of the implementation of register striding.
- A refectoring of the _update_bits() code to reduce duplication.
- Fixes and enhancements for the interrupt implementation which make
it easier to use in a wider range of systems"
* tag 'regmap-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (28 commits)
regmap: irq: add devm apis for regmap_{add,del}_irq_chip
regmap: replace regmap_write_bits()
regmap: irq: Enable irq retriggering for nested irqs
regmap: add regmap_fields_force_xxx() macros
regmap: add regmap_field_force_xxx() macros
regmap: merge regmap_fields_update_bits() into macro
regmap: merge regmap_fields_write() into macro
regmap: add regmap_fields_update_bits_base()
regmap: merge regmap_field_update_bits() into macro
regmap: merge regmap_field_write() into macro
regmap: add regmap_field_update_bits_base()
regmap: merge regmap_update_bits_check_async() into macro
regmap: merge regmap_update_bits_check() into macro
regmap: merge regmap_update_bits_async() into macro
regmap: merge regmap_update_bits() into macro
regmap: add regmap_update_bits_base()
regcache: flat: Introduce register strider order
regcache: Introduce the index parsing API by stride order
regmap: core: Introduce register stride order
regmap: irq: add devm apis for regmap_{add,del}_irq_chip
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi updates from Mark Brown:
"Not the biggest set of changes for SPI but a bit of a pickup in
activity on the core:
- Support for memory mapped read from flash devices via a SPI
controller.
- The beginnings of a message rewriting framework in the core which
should in time allow us to support transforming messages to work
around the limits of controllers or optimise the performance for
controllers transparently to calling drivers.
- Updates to the PXA2xx, the main functional change being to improve
the ACPI support.
- A new driver for the Analog Devices AXI SPI engine"
* tag 'spi-v4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi: (66 commits)
spi: Add gfp parameter to kernel-doc to fix build warning
spi: Fix htmldocs build error due struct spi_replaced_transfers
spi: rockchip: covert rsd_nsecs to u32 type
spi: rockchip: header file cleanup
spi: xilinx: Add devicetree binding for spi-xilinx
spi: respect the maximum segment size of DMA device
spi: rockchip: check requesting dma channel with EPROBE_DEFER
spi: rockchip: migrate to dmaengine_terminate_async
spi: rockchip: check return value of dmaengine_prep_slave_sg
spi: core: Fix deadlock when sending messages
spi/rockchip: fix endian mode for 16-bit transfers
spi/rockchip: Make sure spi clk is on in rockchip_spi_set_cs
spi: pxa2xx: Use newer more explicit DMAengine terminate API
spi: pxa2xx: Add support for Intel Broxton B-Step
spi: lp-8841: return correct error code from probe
spi: imx: drop bogus tests for rx/tx bufs in DMA transfer
spi: imx: set MX51_ECSPI_CTRL_SMC bit in setup function
spi: imx: make some register defines simpler
spi: imx: remove unnecessary bit clearing in mx51_ecspi_config
spi: imx: add support for all SPI word width for DMA
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control updates from Linus Walleij:
"An almost purely driver related set of changes with no major changes
to the framework, only one patch adding an unlocked version of the
pinctrl_find_gpio_range_from_pin() library call.
New drivers:
- ST Microelectronics STM32 MCU support: this is a non-MMU low-end
platform for IoT things (etc).
- Microchip PIC32 MCU support: same story as for STM32.
New subdrivers:
- Allwinner SunXi H3 R_PIO controller support.
- Qualcomm IPQ4019 support.
- MediaTek MT2701 and MT7623.
- Allwinner A64
Non-critical fixes:
- gpio_disable_free() for the Vybrid.
- pinctrl single: use a separate lockdep class.
Misc:
- Substantial cleanups and rewrites for the Super-H PFC driver and
subdrivers.
- Various fixes and cleanups, especially Paul Gortmakers work to make
nonmodular drivers nonmodular"
* tag 'pinctrl-v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl: (75 commits)
pinctrl: single: Use a separate lockdep class
drivers: pinctrl: add driver for Allwinner A64 SoC
pinctrl: Broadcom Northstar2 pinctrl device tree bindings
pinctrl: amlogic: Make driver independent from two-domain configuration
pinctrl: amlogic: Separate some pin functions for Meson8 / Meson8b
pinctrl: at91: use __maybe_unused to hide pm functions
pinctrl: sh-pfc: core: don't open code of_device_get_match_data()
pinctrl: uniphier: rename CONFIG options and file names
pinctrl: sunxi: make A80 explicitly non-modular
pinctrl: stm32: make explicitly non-modular
pinctrl: sh-pfc: make explicitly non-modular
pinctrl: meson: make explicitly non-modular
pinctrl: pinctrl-mt6397 driver explicitly non-modular
pinctrl: sunxi: does not need module.h
pinctrl: pxa2xx: export symbols
pinctrl: sunxi: Change mux setting on PI irq pins
pinctrl: sunxi: Remove non existing irq's
pinctrl: imx: attach iomuxc device to gpr syscon
pinctrl-bcm2835: Fix cut-and-paste error in "pull" parsing
pinctrl: lpc1850-scu: document nxp,gpio-pin-interrupt
...
|
|
Since including linux/string.h will now do the right thing remove the
conditional check.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix some white space format errors.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Try and make the coding style completely consistent throughtout the
autofs module and inline with kernel coding style recommendations.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a performance drop report due to hugepage allocation and in
there half of cpu time are spent on pageblock_pfn_to_page() in
compaction [1].
In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this
case is to find valid pageblock while scanning whole zone range. To
check if pageblock is valid to compact, valid pfn within pageblock is
required and we can obtain it by calling pageblock_pfn_to_page(). This
function checks whether pageblock is in a single zone and return valid
pfn if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to be very
expensive in this workload.
Although we have no way to skip this pageblock check in the system where
hole exists at arbitrary position, we can use cached value for zone
continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.
Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/s
Avg is improved by roughly 30% [2].
[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23
[akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Aaron Lu <aaron.lu@intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Aaron Lu <aaron.lu@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are several users that nest lock_page_memcg() inside lock_page()
to prevent page->mem_cgroup from changing. But the page lock prevents
pages from moving between cgroups, so that is unnecessary overhead.
Remove lock_page_memcg() in contexts with locked contexts and fix the
debug code in the page stat functions to be okay with the page lock.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.
[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Changing a page's memcg association complicates dealing with the page,
so we want to limit this as much as possible. Page migration e.g. does
not have to do that. Just like page cache replacement, it can forcibly
charge a replacement page, and then uncharge the old page when it gets
freed. Temporarily overcharging the cgroup by a single page is not an
issue in practice, and charging is so cheap nowadays that this is much
preferrable to the headache of messing with live pages.
The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup. But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()). That means page->mem_cgroup is always stable
in callers that have the page isolated from the LRU or locked. Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().
[akpm@linux-foundation.org: fix build]
[vdavydov@virtuozzo.com: fix lockdep splat]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.
Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.
[sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
work properly and be as adaptive to new cache workingsets as global
reclaim already is.
This should have been part of the original thrash detection patch
series, but was deferred due to the complexity of those patches.
This patch (of 5):
So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.
Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
we might be able to delete it at some point, and these now easy to
identify locking sites along with it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, all newly added memory blocks remain in 'offline' state
unless someone onlines them, some linux distributions carry special udev
rules like:
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
to make this happen automatically. This is not a great solution for
virtual machines where memory hotplug is being used to address high
memory pressure situations as such onlining is slow and a userspace
process doing this (udev) has a chance of being killed by the OOM killer
as it will probably require to allocate some memory.
Introduce default policy for the newly added memory blocks in
/sys/devices/system/memory/auto_online_blocks file with two possible
values: "offline" which preserves the current behavior and "online"
which causes all newly added memory blocks to go online as soon as
they're added. The default is "offline".
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
By default, page poisoning uses a poison value (0xaa) on free. If this
is changed to 0, the page is not only sanitized but zeroing on alloc
with __GFP_ZERO can be skipped as well. The tradeoff is that detecting
corruption from the poisoning is harder to detect. This feature also
cannot be used with hibernation since pages are not guaranteed to be
zeroed after hibernation.
Credit to Grsecurity/PaX team for inspiring this work
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Page poisoning is currently set up as a feature if architectures don't
have architecture debug page_alloc to allow unmapping of pages. It has
uses apart from that though. Clearing of the pages on free provides an
increase in security as it helps to limit the risk of information leaks.
Allow page poisoning to be enabled as a separate option independent of
kernel_map pages since the two features do separate work. Because of
how hiberanation is implemented, the checks on alloc cannot occur if
hibernation is enabled. The runtime alloc checks can also be enabled
with an option when !HIBERNATION.
Credit to Grsecurity/PaX team for inspiring this work
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since bad_page() is the only user of the badflags parameter of
dump_page_badflags(), we can move the code to bad_page() and simplify a
bit.
The dump_page_badflags() function is renamed to __dump_page() and can
still be called separately from dump_page() for temporary debug prints
where page_owner info is not desired.
The only user-visible change is that page->mem_cgroup is printed before
the bad flags.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page_owner mechanism is useful for dealing with memory leaks. By
reading /sys/kernel/debug/page_owner one can determine the stack traces
leading to allocations of all pages, and find e.g. a buggy driver.
This information might be also potentially useful for debugging, such as
the VM_BUG_ON_PAGE() calls to dump_page(). So let's print the stored
info from dump_page().
Example output:
page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
page dumped because: VM_BUG_ON_PAGE(1)
page->mem_cgroup:ffff8801392c5000
page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
[<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
[<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
[<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
[<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
[<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
[<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
page has been migrated, last migrate reason: compaction
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
During migration, page_owner info is now copied with the rest of the
page, so the stacktrace leading to free page allocation during migration
is overwritten. For debugging purposes, it might be however useful to
know that the page has been migrated since its initial allocation. This
might happen many times during the lifetime for different reasons and
fully tracking this, especially with stacktraces would incur extra
memory costs. As a compromise, store and print the migrate_reason of
the last migration that occurred to the page. This is enough to
distinguish compaction, numa balancing etc.
Example page_owner entry after the patch:
Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b6325>] alloc_pages_vma+0xb5/0x250
[<ffffffff81177491>] shmem_alloc_page+0x61/0x90
[<ffffffff8117a438>] shmem_getpage_gfp+0x678/0x960
[<ffffffff8117c2b9>] shmem_fallocate+0x329/0x440
[<ffffffff811de600>] vfs_fallocate+0x140/0x230
[<ffffffff811df434>] SyS_fallocate+0x44/0x70
[<ffffffff8158cc2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Page has been migrated, last migrate reason: compaction
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The page_owner mechanism stores gfp_flags of an allocation and stack
trace that lead to it. During page migration, the original information
is practically replaced by the allocation of free page as the migration
target. Arguably this is less useful and might lead to all the
page_owner info for migratable pages gradually converge towards
compaction or numa balancing migrations. It has also lead to
inaccuracies such as one fixed by commit e2cfc91120fa ("mm/page_owner:
set correct gfp_mask on page_owner").
This patch thus introduces copying the page_owner info during migration.
However, since the fact that the page has been migrated from its
original place might be useful for debugging, the next patch will
introduce a way to track that information as well.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_PAGE_OWNER attempts to impose negligible runtime overhead when
enabled during compilation, but not actually enabled during runtime by
boot param page_owner=on. This overhead can be further reduced using
the static key mechanism, which this patch does.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The information in /sys/kernel/debug/page_owner includes the migratetype
of the pageblock the page belongs to. This is also checked against the
page's migratetype (as declared by gfp_flags during its allocation), and
the page is reported as Fallback if its migratetype differs from the
pageblock's one. t This is somewhat misleading because in fact fallback
allocation is not the only reason why these two can differ. It also
doesn't direcly provide the page's migratetype, although it's possible
to derive that from the gfp_flags.
It's arguably better to print both page and pageblock's migratetype and
leave the interpretation to the consumer than to suggest fallback
allocation as the only possible reason. While at it, we can print the
migratetypes as string the same way as /proc/pagetypeinfo does, as some
of the numeric values depend on kernel configuration. For that, this
patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
of mm/vmstat.c to mm/page_alloc.c and exports it.
With the new format strings for flags, we can now also provide symbolic
page and gfp flags in the /sys/kernel/debug/page_owner file. This
replaces the positional printing of page flags as single letters, which
might have looked nicer, but was limited to a subset of flags, and
required the user to remember the letters.
Example page_owner entry after the patch:
Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
[<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
[<ffffffff811b4058>] alloc_pages_current+0x88/0x120
[<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
[<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
[<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
[<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
[<ffffffff81160523>] generic_file_read_iter+0x453/0x760
[<ffffffff811e0d57>] __vfs_read+0xa7/0xd0
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In tracepoints, it's possible to print gfp flags in a human-friendly
format through a macro show_gfp_flags(), which defines a translation
array and passes is to __print_flags(). Since the following patch will
introduce support for gfp flags printing in printk(), it would be nice
to reuse the array. This is not straightforward, since __print_flags()
can't simply reference an array defined in a .c file such as mm/debug.c
- it has to be a macro to allow the macro magic to communicate the
format to userspace tools such as trace-cmd.
The solution is to create a macro __def_gfpflag_names which is used both
in show_gfp_flags(), and to define the gfpflag_names[] array in
mm/debug.c.
On the other hand, mm/debug.c also defines translation tables for page
flags and vma flags, and desire was expressed (but not implemented in
this series) to use these also from tracepoints. Thus, this patch also
renames the events/gfpflags.h file to events/mmflags.h and moves the
table definitions there, using the same macro approach as for gfpflags.
This allows translating all three kinds of mm-specific flags both in
tracepoints and printk.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When updating tracing's show_gfp_flags() I have noticed that perf's
gfp_compact_table is also outdated. Fill in the missing flags and place
a note in gfp.h to increase chance that future updates are synced.
Convert the __GFP_X flags from "GFP_X" to "__GFP_X" strings in line with
the previous patch.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The show_gfp_flags() macro provides human-friendly printing of gfp flags
in tracepoints. However, it is somewhat out of date and missing several
flags. This patches fills in the missing flags, and distinguishes
properly between GFP_ATOMIC and __GFP_ATOMIC which were both translated
to "GFP_ATOMIC". More generally, all __GFP_X flags which were
previously printed as GFP_X, are now printed as __GFP_X, since ommiting
the underscores results in output that doesn't actually match the source
code, and can only lead to confusion. Where both variants are defined
equal (e.g. _DMA and _DMA32), the variant without underscores are
preferred.
Also add a note in gfp.h so hopefully future changes will be synced
better.
__GFP_MOVABLE is defined twice in include/linux/gfp.h with different
comments. Leave just the newer one, which was intended to replace the
old one.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The following patch will need to declare array of struct
trace_print_flags in a header. To prevent this header from pulling in
all of RCU through trace_events.h, move the struct
trace_print_flags{_64} definitions to the new lightweight
tracepoint-defs.h header.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SLUB already has a redzone debugging feature. But it is only positioned
at the end of object (aka right redzone) so it cannot catch left oob.
Although current object's right redzone acts as left redzone of next
object, first object in a slab cannot take advantage of this effect.
This patch explicitly adds a left red zone to each object to detect left
oob more precisely.
Background:
Someone complained to me that left OOB doesn't catch even if KASAN is
enabled which does page allocation debugging. That page is out of our
control so it would be allocated when left OOB happens and, in this
case, we can't find OOB. Moreover, SLUB debugging feature can be
enabled without page allocator debugging and, in this case, we will miss
that OOB.
Before trying to implement, I expected that changes would be too
complex, but, it doesn't look that complex to me now. Almost changes
are applied to debug specific functions so I feel okay.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SLAB_DEBUG_FREE allows expensive consistency checks at free to be turned
on or off. Expand its use to be able to turn off all consistency
checks. This gives a nice speed up if you only want features such as
poisoning or tracing.
Credit to Mathias Krause for the original work which inspired this
series
Signed-off-by: Laura Abbott <labbott@fedoraproject.org>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
DEBUG_SLAB_LEAK is a debug option. It's current implementation requires
status buffer so we need more memory to use it. And, it cause
kmem_cache initialization step more complex.
To remove this extra memory usage and to simplify initialization step,
this patch implement this feature with another way.
When user requests to get slab object owner information, it marks that
getting information is started. And then, all free objects in caches
are flushed to corresponding slab page. Now, we can distinguish all
freed object so we can know all allocated objects, too. After
collecting slab object owner information on allocated objects, mark is
checked that there is no free during the processing. If true, we can be
sure that our information is correct so information is returned to user.
Although this way is rather complex, it has two important benefits
mentioned above. So, I think it is worth changing.
There is one drawback that it takes more time to get slab object owner
information but it is just a debug option so it doesn't matter at all.
To help review, this patch implements new way only. Following patch
will remove useless code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
some sites. It makes code unreadable and hard to change.
This patch cleans up this code. The following patch will change the
criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
[akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix up trivial spelling errors, noticed while reading the code.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch introduce a new API call kfree_bulk() for bulk freeing memory
objects not bound to a single kmem_cache.
Christoph pointed out that it is possible to implement freeing of
objects, without knowing the kmem_cache pointer as that information is
available from the object's page->slab_cache. Proposing to remove the
kmem_cache argument from the bulk free API.
Jesper demonstrated that these extra steps per object comes at a
performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled
in and activated runtime that these steps are done anyhow. The extra
cost is most visible for SLAB allocator, because the SLUB allocator does
the page lookup (virt_to_head_page()) anyhow.
Thus, the conclusion was to keep the kmem_cache free bulk API with a
kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
easily. Simply by handling if kmem_cache_free_bulk() gets called with a
kmem_cache NULL pointer.
This does increase the code size a bit, but implementing a separate
kfree_bulk() call would likely increase code size even more.
Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
@ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
Code size increase for SLAB:
add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
function old new delta
kmem_cache_free_bulk 660 734 +74
SLAB fastpath: 87 cycles(tsc) 21.814
sz - fallback - kmem_cache_free_bulk - kfree_bulk
1 - 103 cycles 25.878 ns - 41 cycles 10.498 ns - 81 cycles 20.312 ns
2 - 94 cycles 23.673 ns - 26 cycles 6.682 ns - 42 cycles 10.649 ns
3 - 92 cycles 23.181 ns - 21 cycles 5.325 ns - 39 cycles 9.950 ns
4 - 90 cycles 22.727 ns - 18 cycles 4.673 ns - 26 cycles 6.693 ns
8 - 89 cycles 22.270 ns - 14 cycles 3.664 ns - 23 cycles 5.835 ns
16 - 88 cycles 22.038 ns - 14 cycles 3.503 ns - 22 cycles 5.543 ns
30 - 89 cycles 22.284 ns - 13 cycles 3.310 ns - 20 cycles 5.197 ns
32 - 88 cycles 22.249 ns - 13 cycles 3.420 ns - 20 cycles 5.166 ns
34 - 88 cycles 22.224 ns - 14 cycles 3.643 ns - 20 cycles 5.170 ns
48 - 88 cycles 22.088 ns - 14 cycles 3.507 ns - 20 cycles 5.203 ns
64 - 88 cycles 22.063 ns - 13 cycles 3.428 ns - 20 cycles 5.152 ns
128 - 89 cycles 22.483 ns - 15 cycles 3.891 ns - 23 cycles 5.885 ns
158 - 89 cycles 22.381 ns - 15 cycles 3.779 ns - 22 cycles 5.548 ns
250 - 91 cycles 22.798 ns - 16 cycles 4.152 ns - 23 cycles 5.967 ns
SLAB when enabling MEMCG_KMEM runtime:
- kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
1 - 148 cycles 37.220 ns - 66 cycles 16.622 ns - 66 cycles 16.583 ns
2 - 141 cycles 35.510 ns - 51 cycles 12.820 ns - 58 cycles 14.625 ns
3 - 140 cycles 35.017 ns - 37 cycles 9.326 ns - 33 cycles 8.474 ns
4 - 137 cycles 34.507 ns - 31 cycles 7.888 ns - 33 cycles 8.300 ns
8 - 140 cycles 35.069 ns - 25 cycles 6.461 ns - 25 cycles 6.436 ns
16 - 138 cycles 34.542 ns - 23 cycles 5.945 ns - 22 cycles 5.670 ns
30 - 136 cycles 34.227 ns - 22 cycles 5.502 ns - 22 cycles 5.587 ns
32 - 136 cycles 34.253 ns - 21 cycles 5.475 ns - 21 cycles 5.324 ns
34 - 136 cycles 34.254 ns - 21 cycles 5.448 ns - 20 cycles 5.194 ns
48 - 136 cycles 34.075 ns - 21 cycles 5.458 ns - 21 cycles 5.367 ns
64 - 135 cycles 33.994 ns - 21 cycles 5.350 ns - 21 cycles 5.259 ns
128 - 137 cycles 34.446 ns - 23 cycles 5.816 ns - 22 cycles 5.688 ns
158 - 137 cycles 34.379 ns - 22 cycles 5.727 ns - 22 cycles 5.602 ns
250 - 138 cycles 34.755 ns - 24 cycles 6.093 ns - 23 cycles 5.986 ns
Code size increase for SLUB:
function old new delta
kmem_cache_free_bulk 717 799 +82
SLUB benchmark:
SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
sz - fallback - kmem_cache_free_bulk - kfree_bulk
1 - 61 cycles 15.486 ns - 53 cycles 13.364 ns - 57 cycles 14.464 ns
2 - 54 cycles 13.703 ns - 32 cycles 8.110 ns - 33 cycles 8.482 ns
3 - 53 cycles 13.272 ns - 25 cycles 6.362 ns - 27 cycles 6.947 ns
4 - 51 cycles 12.994 ns - 24 cycles 6.087 ns - 24 cycles 6.078 ns
8 - 50 cycles 12.576 ns - 21 cycles 5.354 ns - 22 cycles 5.513 ns
16 - 49 cycles 12.368 ns - 20 cycles 5.054 ns - 20 cycles 5.042 ns
30 - 49 cycles 12.273 ns - 18 cycles 4.748 ns - 19 cycles 4.758 ns
32 - 49 cycles 12.401 ns - 19 cycles 4.821 ns - 19 cycles 4.810 ns
34 - 98 cycles 24.519 ns - 24 cycles 6.154 ns - 24 cycles 6.157 ns
48 - 83 cycles 20.833 ns - 21 cycles 5.446 ns - 21 cycles 5.429 ns
64 - 75 cycles 18.891 ns - 20 cycles 5.247 ns - 20 cycles 5.238 ns
128 - 93 cycles 23.271 ns - 27 cycles 6.856 ns - 27 cycles 6.823 ns
158 - 102 cycles 25.581 ns - 30 cycles 7.714 ns - 30 cycles 7.695 ns
250 - 107 cycles 26.917 ns - 38 cycles 9.514 ns - 38 cycles 9.506 ns
SLUB when enabling MEMCG_KMEM runtime:
- kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
1 - 85 cycles 21.484 ns - 78 cycles 19.569 ns - 75 cycles 18.938 ns
2 - 81 cycles 20.363 ns - 45 cycles 11.258 ns - 44 cycles 11.076 ns
3 - 78 cycles 19.709 ns - 33 cycles 8.354 ns - 32 cycles 8.044 ns
4 - 77 cycles 19.430 ns - 28 cycles 7.216 ns - 28 cycles 7.003 ns
8 - 101 cycles 25.288 ns - 23 cycles 5.849 ns - 23 cycles 5.787 ns
16 - 76 cycles 19.148 ns - 20 cycles 5.162 ns - 20 cycles 5.081 ns
30 - 76 cycles 19.067 ns - 19 cycles 4.868 ns - 19 cycles 4.821 ns
32 - 76 cycles 19.052 ns - 19 cycles 4.857 ns - 19 cycles 4.815 ns
34 - 121 cycles 30.291 ns - 25 cycles 6.333 ns - 25 cycles 6.268 ns
48 - 108 cycles 27.111 ns - 21 cycles 5.498 ns - 21 cycles 5.458 ns
64 - 100 cycles 25.164 ns - 20 cycles 5.242 ns - 20 cycles 5.229 ns
128 - 155 cycles 38.976 ns - 27 cycles 6.886 ns - 27 cycles 6.892 ns
158 - 132 cycles 33.034 ns - 30 cycles 7.711 ns - 30 cycles 7.728 ns
250 - 130 cycles 32.612 ns - 38 cycles 9.560 ns - 38 cycles 9.549 ns
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remove the SLAB specific function slab_should_failslab(), by moving the
check against fault-injection for the bootstrap slab, into the shared
function should_failslab() (used by both SLAB and SLUB).
This is a step towards sharing alloc_hook's between SLUB and SLAB.
This bootstrap slab "kmem_cache" is used for allocating struct
kmem_cache objects to the allocator itself.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull cpu hotplug updates from Thomas Gleixner:
"This is the first part of the ongoing cpu hotplug rework:
- Initial implementation of the state machine
- Runs all online and prepare down callbacks on the plugged cpu and
not on some random processor
- Replaces busy loop waiting with completions
- Adds tracepoints so the states can be followed"
More detailed commentary on this work from an earlier email:
"What's wrong with the current cpu hotplug infrastructure?
- Asymmetry
The hotplug notifier mechanism is asymmetric versus the bringup and
teardown. This is mostly caused by the notifier mechanism.
- Largely undocumented dependencies
While some notifiers use explicitely defined notifier priorities,
we have quite some notifiers which use numerical priorities to
express dependencies without any documentation why.
- Control processor driven
Most of the bringup/teardown of a cpu is driven by a control
processor. While it is understandable, that preperatory steps,
like idle thread creation, memory allocation for and initialization
of essential facilities needs to be done before a cpu can boot,
there is no reason why everything else must run on a control
processor. Before this patch series, bringup looks like this:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring the rest up
- All or nothing approach
There is no way to do partial bringups. That's something which is
really desired because we waste e.g. at boot substantial amount of
time just busy waiting that the cpu comes to life. That's stupid
as we could very well do preparatory steps and the initial IPI for
other cpus and then go back and do the necessary low level
synchronization with the freshly booted cpu.
- Minimal debuggability
Due to the notifier based design, it's impossible to switch between
two stages of the bringup/teardown back and forth in order to test
the correctness. So in many hotplug notifiers the cancel
mechanisms are either not existant or completely untested.
- Notifier [un]registering is tedious
To [un]register notifiers we need to protect against hotplug at
every callsite. There is no mechanism that bringup/teardown
callbacks are issued on the online cpus, so every caller needs to
do it itself. That also includes error rollback.
What's the new design?
The base of the new design is a symmetric state machine, where both
the control processor and the booting/dying cpu execute a well
defined set of states. Each state is symmetric in the end, except
for some well defined exceptions, and the bringup/teardown can be
stopped and reversed at almost all states.
So the bringup of a cpu will look like this in the future:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
bring itself up
The synchronization step does not require the control cpu to wait.
That mechanism can be done asynchronously via a worker or some
other mechanism.
The teardown can be made very similar, so that the dying cpu cleans
up and brings itself down. Cleanups which need to be done after
the cpu is gone, can be scheduled asynchronously as well.
There is a long way to this, as we need to refactor the notion when a
cpu is available. Today we set the cpu online right after it comes
out of the low level bringup, which is not really correct.
The proper mechanism is to set it to available, i.e. cpu local
threads, like softirqd, hotplug thread etc. can be scheduled on that
cpu, and once it finished all booting steps, it's set to online, so
general workloads can be scheduled on it. The reverse happens on
teardown. First thing to do is to forbid scheduling of general
workloads, then teardown all the per cpu resources and finally shut it
off completely.
This patch series implements the basic infrastructure for this at the
core level. This includes the following:
- Basic state machine implementation with well defined states, so
ordering and prioritization can be expressed.
- Interfaces to [un]register state callbacks
This invokes the bringup/teardown callback on all online cpus with
the proper protection in place and [un]installs the callbacks in
the state machine array.
For callbacks which have no particular ordering requirement we have
a dynamic state space, so that drivers don't have to register an
explicit hotplug state.
If a callback fails, the code automatically does a rollback to the
previous state.
- Sysfs interface to drive the state machine to a particular step.
This is only partially functional today. Full functionality and
therefor testability will be achieved once we converted all
existing hotplug notifiers over to the new scheme.
- Run all CPU_ONLINE/DOWN_PREPARE notifiers on the booting/dying
processor:
Control CPU Booting CPU
do preparatory steps
kick cpu into life
do low level init
sync with booting cpu sync with control cpu
wait for boot
bring itself up
Signal completion to control cpu
In a previous step of this work we've done a full tree mechanical
conversion of all hotplug notifiers to the new scheme. The balance
is a net removal of about 4000 lines of code.
This is not included in this series, as we decided to take a
different approach. Instead of mechanically converting everything
over, we will do a proper overhaul of the usage sites one by one so
they nicely fit into the symmetric callback scheme.
I decided to do that after I looked at the ugliness of some of the
converted sites and figured out that their hotplug mechanism is
completely buggered anyway. So there is no point to do a
mechanical conversion first as we need to go through the usage
sites one by one again in order to achieve a full symmetric and
testable behaviour"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
cpu/hotplug: Document states better
cpu/hotplug: Fix smpboot thread ordering
cpu/hotplug: Remove redundant state check
cpu/hotplug: Plug death reporting race
rcu: Make CPU_DYING_IDLE an explicit call
cpu/hotplug: Make wait for dead cpu completion based
cpu/hotplug: Let upcoming cpu bring itself fully up
arch/hotplug: Call into idle with a proper state
cpu/hotplug: Move online calls to hotplugged cpu
cpu/hotplug: Create hotplug threads
cpu/hotplug: Split out the state walk into functions
cpu/hotplug: Unpark smpboot threads from the state machine
cpu/hotplug: Move scheduler cpu_online notifier to hotplug core
cpu/hotplug: Implement setup/removal interface
cpu/hotplug: Make target state writeable
cpu/hotplug: Add sysfs state interface
cpu/hotplug: Hand in target state to _cpu_up/down
cpu/hotplug: Convert the hotplugged cpu work to a state machine
cpu/hotplug: Convert to a state machine for the control processor
cpu/hotplug: Add tracepoints
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The 4.6 pile of irq updates contains:
- Support for IPI irqdomains to support proper integration of IPIs to
and from coprocessors. The first user of this new facility is
MIPS. The relevant MIPS patches come with the core to avoid merge
ordering issues and have been acked by Ralf.
- A new command line option to set the default interrupt affinity
mask at boot time.
- Support for some more new ARM and MIPS interrupt controllers:
tango, alpine-msix and bcm6345-l1
- Two small cleanups for x86/apic which we merged into irq/core to
avoid yet another branch in x86 with two tiny commits.
- The usual set of updates, cleanups in drivers/irqchip. Mostly in
the area of ARM-GIC, arada-37-xp and atmel chips. Nothing
outstanding here"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
irqchip/irq-alpine-msi: Release the correct domain on error
irqchip/mxs: Fix error check of of_io_request_and_map()
irqchip/sunxi-nmi: Fix error check of of_io_request_and_map()
genirq: Export IRQ functions for module use
irqchip/gic/realview: Support more RealView DCC variants
Documentation/bindings: Document the Alpine MSIX driver
irqchip: Add the Alpine MSIX interrupt controller
irqchip/gic-v3: Always return IRQ_SET_MASK_OK_DONE in gic_set_affinity
irqchip/gic-v3-its: Mark its_init() and its children as __init
irqchip/gic-v3: Remove gic_root_node variable from the ITS code
irqchip/gic-v3: ACPI: Add redistributor support via GICC structures
irqchip/gic-v3: Add ACPI support for GICv3/4 initialization
irqchip/gic-v3: Refactor gic_of_init() for GICv3 driver
x86/apic: Deinline _flat_send_IPI_mask, save ~150 bytes
x86/apic: Deinline __default_send_IPI_*, save ~200 bytes
dt-bindings: interrupt-controller: Add SoC-specific compatible string to Marvell ODMI
irqchip/mips-gic: Add new DT property to reserve IPIs
MIPS: Delete smp-gic.c
MIPS: Make smp CMP, CPS and MT use the new generic IPI functions
MIPS: Add generic SMP IPI support
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The timer department delivers this time:
- Support for cross clock domain timestamps in the core code plus a
first user. That allows more precise timestamping for PTP and
later for audio and other peripherals.
The ptp/e1000e patches have been acked by the relevant maintainers
and are carried in the timer tree to avoid merge ordering issues.
- Support for unregistering the current clocksource watchdog. That
lifts a limitation for switching clocksources which has been there
from day 1
- The usual pile of fixes and updates to the core and the drivers.
Nothing outstanding and exciting"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
time/timekeeping: Work around false positive GCC warning
e1000e: Adds hardware supported cross timestamp on e1000e nic
ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping
x86/tsc: Always Running Timer (ART) correlated clocksource
hrtimer: Revert CLOCK_MONOTONIC_RAW support
time: Add history to cross timestamp interface supporting slower devices
time: Add driver cross timestamp interface for higher precision time synchronization
time: Remove duplicated code in ktime_get_raw_and_real()
time: Add timekeeping snapshot code capturing system time and counter
time: Add cycles to nanoseconds translation
jiffies: Use CLOCKSOURCE_MASK instead of constant
clocksource: Introduce clocksource_freq2mult()
clockevents/drivers/exynos_mct: Implement ->set_state_oneshot_stopped()
clockevents/drivers/arm_global_timer: Implement ->set_state_oneshot_stopped()
clockevents/drivers/arm_arch_timer: Implement ->set_state_oneshot_stopped()
clocksource/drivers/arm_global_timer: Register delay timer
clocksource/drivers/lpc32xx: Support timer-based ARM delay
clocksource/drivers/lpc32xx: Support periodic mode
clocksource/drivers/lpc32xx: Don't use the prescaler counter for clockevents
clocksource/drivers/rockchip: Add err handle for rk_timer_init
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
"The main changes in this cycle were:
- Miscellaneous fixes, cleanups, restructuring.
- RCU torture-test updates"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rcu: Export rcu_gp_is_normal()
rcu: Remove rcu_user_hooks_switch
rcu: Catch up rcu_report_qs_rdp() comment with reality
rcu: Document unique-name limitation for DEFINE_STATIC_SRCU()
rcu: Make rcu/tiny_plugin.h explicitly non-modular
irq: Privatize irq_common_data::state_use_accessors
RCU: Privatize rcu_node::lock
sparse: Add __private to privatize members of structs
rcu: Remove useless rcu_data_p when !PREEMPT_RCU
rcutorture: Correct no-expedite console messages
rcu: Set rdp->gpwrap when CPU is idle
rcu: Stop treating in-kernel CPU-bound workloads as errors
rcu: Update rcu_report_qs_rsp() comment
rcu: Assign false instead of 0 for ->core_needs_qs
rcutorture: Check for self-detected stalls
rcutorture: Don't keep empty console.log.diags files
rcutorture: Add checks for rcutorture writer starvation
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
"This is another big update. Main changes are:
- lots of x86 system call (and other traps/exceptions) entry code
enhancements. In particular the complex parts of the 64-bit entry
code have been migrated to C code as well, and a number of dusty
corners have been refreshed. (Andy Lutomirski)
- vDSO special mapping robustification and general cleanups (Andy
Lutomirski)
- cpufeature refactoring, cleanups and speedups (Borislav Petkov)
- lots of other changes ..."
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
x86/cpufeature: Enable new AVX-512 features
x86/entry/traps: Show unhandled signal for i386 in do_trap()
x86/entry: Call enter_from_user_mode() with IRQs off
x86/entry/32: Change INT80 to be an interrupt gate
x86/entry: Improve system call entry comments
x86/entry: Remove TIF_SINGLESTEP entry work
x86/entry/32: Add and check a stack canary for the SYSENTER stack
x86/entry/32: Simplify and fix up the SYSENTER stack #DB/NMI fixup
x86/entry: Only allocate space for tss_struct::SYSENTER_stack if needed
x86/entry: Vastly simplify SYSENTER TF (single-step) handling
x86/entry/traps: Clear DR6 early in do_debug() and improve the comment
x86/entry/traps: Clear TIF_BLOCKSTEP on all debug exceptions
x86/entry/32: Restore FLAGS on SYSEXIT
x86/entry/32: Filter NT and speed up AC filtering in SYSENTER
x86/entry/compat: In SYSENTER, sink AC clearing below the existing FLAGS test
selftests/x86: In syscall_nt, test NT|TF as well
x86/asm-offsets: Remove PARAVIRT_enabled
x86/entry/32: Introduce and use X86_BUG_ESPFIX instead of paravirt_enabled
uprobes: __create_xol_area() must nullify xol_mapping.fault
x86/cpufeature: Create a new synthetic cpu capability for machine check recovery
...
|
|
* pci/resource:
PCI: Simplify pci_create_attr() control flow
PCI: Don't leak memory if sysfs_create_bin_file() fails
PCI: Simplify sysfs ROM cleanup
PCI: Remove unused IORESOURCE_ROM_COPY and IORESOURCE_ROM_BIOS_COPY
MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in shadow ROM resource
MIPS: Loongson 3: Use temporary struct resource * to avoid repetition
ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM resource
ia64/PCI: Use ioremap() instead of open-coded equivalent
ia64/PCI: Use temporary struct resource * to avoid repetition
PCI: Clean up pci_map_rom() whitespace
PCI: Remove arch-specific IORESOURCE_ROM_SHADOW size from sysfs
PCI: Set ROM shadow location in arch code, not in PCI core
PCI: Don't enable/disable ROM BAR if we're using a RAM shadow copy
PCI: Don't assign or reassign immutable resources
PCI: Mark shadow copy of VGA ROM as IORESOURCE_PCI_FIXED
x86/PCI: Mark Broadwell-EP Home Agent & PCU as having non-compliant BARs
PCI: Disable IO/MEM decoding for devices with non-compliant BARs
|