Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"cgroup changes for v4.6-rc1. No userland visible behavior changes in
this pull request. I'll send out a separate pull request for the
addition of cgroup namespace support.
- The biggest change is the revamping of cgroup core task migration
and controller handling logic. There are quite a few places where
controllers and tasks are manipulated. Previously, many of those
places implemented custom operations for each specific use case
assuming specific starting conditions. While this worked, it makes
the code fragile and difficult to follow.
The bulk of this pull request restructures these operations so that
most related operations are performed through common helpers which
implement recursive (subtrees are always processed consistently)
and idempotent (they make cgroup hierarchy converge to the target
state rather than performing operations assuming specific starting
conditions). This makes the code a lot easier to understand,
verify and extend.
- Implicit controller support is added. This is primarily for using
perf_event on the v2 hierarchy so that perf can match cgroup v2
path without requiring the user to do anything special. The kernel
portion of perf_event changes is acked but userland changes are
still pending review.
- cgroup_no_v1= boot parameter added to ease testing cgroup v2 in
certain environments.
- There is a regression introduced during v4.4 devel cycle where
attempts to migrate zombie tasks can mess up internal object
management. This was fixed earlier this week and included in this
pull request w/ stable cc'd.
- Misc non-critical fixes and improvements"
* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (44 commits)
cgroup: avoid false positive gcc-6 warning
cgroup: ignore css_sets associated with dead cgroups during migration
Documentation: cgroup v2: Trivial heading correction.
cgroup: implement cgroup_subsys->implicit_on_dfl
cgroup: use css_set->mg_dst_cgrp for the migration target cgroup
cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup
cgroup: move migration destination verification out of cgroup_migrate_prepare_dst()
cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses()
cgroup: Trivial correction to reflect controller.
cgroup: remove stale item in cgroup-v1 document INDEX file.
cgroup: update css iteration in cgroup_update_dfl_csses()
cgroup: allocate 2x cgrp_cset_links when setting up a new root
cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask
cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends
cgroup: use cgroup_apply_enable_control() in cgroup creation path
cgroup: combine cgroup_mutex locking and offline css draining
cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()
cgroup: introduce cgroup_{save|propagate|restore}_control()
cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive
cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata updates from Tejun Heo:
- ahci grew runtime power management support so that the controller can
be turned off if no devices are attached.
- sata_via isn't dead yet. It got hotplug support and more refined
workaround for certain WD drives.
- Misc cleanups. There's a merge from for-4.5-fixes to avoid confusing
conflicts in ahci PCI ID table.
* 'for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ata: ahci_xgene: dereferencing uninitialized pointer in probe
AHCI: Remove obsolete Intel Lewisburg SATA RAID device IDs
ata: sata_rcar: Use ARCH_RENESAS
sata_via: Implement hotplug for VT6421
sata_via: Apply WD workaround only when needed on VT6421
ahci: Add runtime PM support for the host controller
ahci: Add functions to manage runtime PM of AHCI ports
ahci: Convert driver to use modern PM hooks
ahci: Cache host controller version
scsi: Drop runtime PM usage count after host is added
scsi: Set request queue runtime PM status back to active on resume
block: Add blk_set_runtime_active()
ata: ahci_mvebu: add support for Armada 3700 variant
libata: fix unbalanced spin_lock_irqsave/spin_unlock_irq() in ata_scsi_park_show()
libata: support AHCI on OCTEON platform
|
|
Merge second patch-bomb from Andrew Morton:
- a couple of hotfixes
- the rest of MM
- a new timer slack control in procfs
- a couple of procfs fixes
- a few misc things
- some printk tweaks
- lib/ updates, notably to radix-tree.
- add my and Nick Piggin's old userspace radix-tree test harness to
tools/testing/radix-tree/. Matthew said it was a godsend during the
radix-tree work he did.
- a few code-size improvements, switching to __always_inline where gcc
screwed up.
- partially implement character sets in sscanf
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
sscanf: implement basic character sets
lib/bug.c: use common WARN helper
param: convert some "on"/"off" users to strtobool
lib: add "on"/"off" support to kstrtobool
lib: update single-char callers of strtobool()
lib: move strtobool() to kstrtobool()
include/linux/unaligned: force inlining of byteswap operations
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
usb: common: convert to use match_string() helper
ide: hpt366: convert to use match_string() helper
ata: hpt366: convert to use match_string() helper
power: ab8500: convert to use match_string() helper
power: charger_manager: convert to use match_string() helper
drm/edid: convert to use match_string() helper
pinctrl: convert to use match_string() helper
device property: convert to use match_string() helper
lib/string: introduce match_string() helper
radix-tree tests: add test for radix_tree_iter_next
radix-tree tests: add regression3 test
...
|
|
Pull block driver updates from Jens Axboe:
"This is the block driver pull request for this merge window. It sits
on top of for-4.6/core, that was just sent out.
This contains:
- A set of fixes for lightnvm. One from Alan, fixing an overflow,
and the rest from the usual suspects, Javier and Matias.
- A set of fixes for nbd from Markus and Dan, and a fixup from Arnd
for correct usage of the signed 64-bit divider.
- A set of bug fixes for the Micron mtip32xx, from Asai.
- A fix for the brd discard handling from Bart.
- Update the maintainers entry for cciss, since that hardware has
transferred ownership.
- Three bug fixes for bcache from Eric Wheeler.
- Set of fixes for xen-blk{back,front} from Jan and Konrad.
- Removal of the cpqarray driver. It has been disabled in Kconfig
since 2013, and we were initially scheduled to remove it in 3.15.
- Various updates and fixes for NVMe, with the most important being:
- Removal of the per-device NVMe thread, replacing that with a
watchdog timer instead. From Christoph.
- Exposing the namespace WWID through sysfs, from Keith.
- Set of cleanups from Ming Lin.
- Logging the controller device name instead of the underlying
PCI device name, from Sagi.
- And a bunch of fixes and optimizations from the usual suspects
in this area"
* 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits)
NVMe: Expose ns wwid through single sysfs entry
drivers:block: cpqarray clean up
brd: Fix discard request processing
cpqarray: remove it from the kernel
cciss: update MAINTAINERS
NVMe: Remove unused sq_head read in completion path
bcache: fix cache_set_flush() NULL pointer dereference on OOM
bcache: cleaned up error handling around register_cache()
bcache: fix race of writeback thread starting before complete initialization
NVMe: Create discard zero quirk white list
nbd: use correct div_s64 helper
mtip32xx: remove unneeded variable in mtip_cmd_timeout()
lightnvm: generalize rrpc ppa calculations
lightnvm: remove struct nvm_dev->total_blocks
lightnvm: rename ->nr_pages to ->nr_sects
lightnvm: update closed list outside of intr context
xen/blback: Fit the important information of the thread in 17 characters
lightnvm: fold get bb tbl when using dual/quad plane mode
lightnvm: fix up nonsensical configure overrun checking
xen-blkback: advertise indirect segment support earlier
...
|
|
Pull core block updates from Jens Axboe:
"Here are the core block changes for this merge window. Not a lot of
exciting stuff going on in this round, most of the changes have been
on the driver side of things. That pull request is coming next. This
pull request contains:
- A set of fixes for chained bio handling from Christoph.
- A tag bounds check for blk-mq from Hannes, ensuring that we don't
do something stupid if a device reports an invalid tag value.
- A set of fixes/updates for the CFQ IO scheduler from Jan Kara.
- A set of blk-mq fixes from Keith, adding support for dynamic
hardware queues, and fixing init of max_dev_sectors for stacking
devices.
- A fix for the dynamic hw context from Ming.
- Enabling of cgroup writeback support on a block device, from
Shaohua"
* 'for-4.6/core' of git://git.kernel.dk/linux-block:
blk-mq: add bounds check on tag-to-rq conversion
block: bio_remaining_done() isn't unlikely
block: cleanup bio_endio
block: factor out chained bio completion
block: don't unecessarily clobber bi_error for chained bios
block-dev: enable writeback cgroup support
blk-mq: Fix NULL pointer updating nr_requests
blk-mq: mark request queue as mq asap
block: Initialize max_dev_sectors to 0
blk-mq: dynamic h/w context count
cfq-iosched: Allow parent cgroup to preempt its child
cfq-iosched: Allow sync noidle workloads to preempt each other
cfq-iosched: Reorder checks in cfq_should_preempt()
cfq-iosched: Don't group_idle if cfqq has big thinktime
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"New Drivers:
- Freescale Touch Screen ADC
- X-Powers AXP PMIC with RSB
- TI TPS65086 Power Management IC (PMIC)
New Device Support:
- Supply device PCI IDs for Intel Broxton
Fix-ups:
- Move to clkdev_create() API; intel_quark_i2c_gpio
- Complete re-write of TI's TPS65912 Power Management IC (PMIC)
- Remove unnecessary function argument; axp20x
- Separate out bus related code; axp20x
- Coding Style changes; axp20x
- Allow more drivers to be compiled as modules
- Work around false positive 'used uninitialised' warning; db8500-prcmu
Bug Fixes:
- Remove do_div(); fsl-imx25-gcq
- Fix driver init when built-in; tps65010
- Fix clock-unregister leak; intel-lpss"
* tag 'mfd-for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (53 commits)
mfd: intel-lpss: Pass I2C configuration via properties on BXT
mfd: imx6sx: Add PCIe register definitions for iomuxc gpr
mfd: ipaq-micro: Use __maybe_unused to hide pm functions
mfd: max77686: Add max77802 to I2C device ID table
mfd: max77686: Export OF module alias information
mfd: max77686: Allow driver to be built as a module
mfd: stmpe: Add the proper PWM resources
mfd: tps65090: Set regmap config reg counts properly
mfd: syscon: Return ENOTSUPP instead of ENOSYS when disabled
mfd: as3711: Set regmap config reg counts properly
mfd: rc5t583: Set regmap config reg counts properly
gpio: tps65086: Add GPO driver for the TPS65086 PMIC
mfd: mt6397: Add platform device ID table
mfd: da9063: Fix missing volatile registers in the core regmap_range volatile lists
mfd: mt6397: Add MT6323 support to MT6397 driver
mfd: mt6397: Add support for different Slave types
mfd: mt6397: int_con and int_status may vary in location
dt-bindings: mfd: Add bindings for the MediaTek MT6323 PMIC
mfd: da9062: Fix missing volatile registers in the core regmap_range volatile lists
mfd: Add documentation for ACT8945A DT bindings
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"After a heavy storm by syzkaller in 4.5 cycle, we have relatively few
changes in the core at this time while a lot of changes are found in
the driver side, unsurprisingly. Below are some highlights:
ALSA core:
- A few more hardening in ALSA timer codes
- An extension of sequencer API for advertising the card / pid
- Small fixes in compress-offload and jack layers
HD-audio:
- Dynamic PCM assignment in HDMI/DP codec; preparation for upcoming
DP-MST support
- Lots of code refactoring for sharing with ASoC SKL driver
- Regression fixes for Intel HDMI/DP
- Fixups for CX20724 codec, Lenovo AiO
USB-audio:
- Add quirk_alias option to make quirk debugging easier
- Fixes for possible Oops by malformed firmware
Firewire:
- Add support for FW-1804 in tascam driver
- Improvements / changes in card registration, multi stream handling,
etc for DICE
- Lots of code refactoring
ASoC:
- Enhancements of still ongoing topology API
- Lots of commits for Intel Skylake support including HDMI support
- A few Intel Atom driver updates for recent devices
- Lots of improvements to the Renesas drivers
- Capture support for Qualcomm drivers
- Support for TI DaVinci DRA7xxx devices
- New machine drivers for Freescale systems with Cirrus CODECs,
Mediatek systems with RT5650 CODECs
- New CPU drivers for Allwinner S/PDIF controllers
- New CODEC drivers for Maxim MAX9867 and MAX98926 and Realtek RT5514"
* tag 'sound-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (291 commits)
ALSA: hda - Fix mutex deadlock at HDMI/DP hotplug
ALSA: ctl: change return value in compatibility layer so that it's the same value in core implementation
ALSA: mixart: silence an uninitialized variable warning
ALSA: usb-audio: Add sanity checks for endpoint accesses
ALSA: usb-audio: Minor code cleanup in create_fixed_stream_quirk()
ALSA: usb-audio: Fix NULL dereference in create_fixed_stream_quirk()
ALSA: hda - Limit i915 HDMI binding only for HSW and later
ALSA: hda - Fix unconditional GPIO toggle via automute
ALSA: mixart: silence unitialized variable warnings
ALSA: hda - Fixes double fault in nvhdmi_chmap_cea_alloc_validate_get_type
ALSA: intel8x0: Add clock quirk entry for AD1981B on IBM ThinkPad X41.
ALSA: hda - Add new GPU codec ID 0x10de0082 to snd-hda
ASoC: rsnd: add simplified module explanation
ASoC: hdac_hdmi: Add broxton device ID
ASoC: Intel: Bxtn: Add Broxton PCI ID
ASoC: Intel: Skylake: Move Skylake dsp ops & loader ops
ASoC: Intel: add dmabuffer to common sst_dsp
ASoC: Intel: Skylake: Unstatify skl_dsp_enable_core
ASoC: Intel: Skylake: Fix whitepsace issues
ASoC: Intel: Skylake: Move module id defines
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma updates from Doug Ledford:
"Initial roundup of 4.6 merge window patches.
This is the first of two pull requests. It is the smaller request,
but touches for more different things (this is everything but what is
in or going into staging). The pull request for the code in
staging/rdma is on hold until after we decide what to do on the
write/writev API issue and may be partially deferred until 4.7 as a
result.
Summary:
- cxgb4 updates
- nes updates
- unification of iwarp portmapper code to core
- add drain_cq API
- various ib_core updates
- minor ipoib updates
- minor mlx4 updates
- more significant mlx5 updates (including a minor merge conflict
with net-next tree...merge is simple to resolve and Stephen's
resolution was confirmed by Mellanox)
- trivial net/9p rdma conversion
- ocrdma RoCEv2 update
- srpt updates"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (85 commits)
iwpm: crash fix for large connections test
iw_cxgb3: support for iWARP port mapping
iw_cxgb4: remove port mapper related code
iw_nes: remove port mapper related code
iwcm: common code for port mapper
net/9p: convert to new CQ API
IB/mlx5: Add support for don't trap rules
net/mlx5_core: Introduce forward to next priority action
net/mlx5_core: Create anchor of last flow table
iser: Accept arbitrary sg lists mapping if the device supports it
mlx5: Add arbitrary sg list support
IB/core: Add arbitrary sg_list support
IB/mlx5: Expose correct max_fast_reg_page_list_len
IB/mlx5: Make coding style more consistent
IB/mlx5: Convert UMR CQ to new CQ API
IB/ocrdma: Skip using unneeded intermediate variable
IB/ocrdma: Skip using unneeded intermediate variable
IB/ocrdma: Delete unnecessary variable initialisations in 11 functions
IB/core: Documentation fix in the MAD header file
IB/core: trivial prink cleanup.
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver updates from Greg KH:
"Here is the big staging driver pull request for 4.6-rc1.
Lots of little things here, over 1600 patches or so. Notable is all
of the good Lustre work happening, those developers have finally woken
up and are cleaning up their code greatly. The Outreachy intern
application process is also happening, which brought in another 400 or
so patches. Full details are in the very long shortlog.
All of these have been in linux-next with no reported issues"
* tag 'staging-4.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1673 commits)
staging: lustre: fix aligments in lnet selftest
staging: lustre: report minimum of two buffers for LNet selftest load test
staging: lustre: test for proper errno code in lstcon_rpc_trans_abort
staging: lustre: filter remaining extra spacing for lnet selftest
staging: lustre: remove extra spacing when setting variable for lnet selftest
staging: lustre: remove extra spacing of variable declartions for lnet selftest
staging: lustre: fix spacing issues checkpatch reported in lnet selftest
staging: lustre: remove returns in void function for lnet selftest
staging: lustre: fix bogus lst errors for lnet selftest
staging: netlogic: Replacing pr_err with dev_err after the call to devm_kzalloc
staging: mt29f_spinand: Replacing pr_info with dev_info after the call to devm_kzalloc
staging: android: ion: fix up file mode
staging: ion: debugfs invalid gfp mask
staging: rts5208: Replace pci_enable_device with pcim_enable_device
Staging: ieee80211: Place constant on right side of the test.
staging: speakup: Replace del_timer with del_timer_sync
staging: lowmemorykiller: fix 2 checks that checkpatch complained
staging: mt29f_spinand: Drop void pointer cast
staging: rdma: hfi1: file_ops: Replace ALIGN with PAGE_ALIGN
staging: rdma: hfi1: driver: Replace IS_ALIGNED with PAGE_ALIGNED
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
"The most notable item is addition of support for Synaptics RMI4
protocol which is native protocol for all current Synaptics devices
(touchscreens, touchpads). In later releases we'll switch devices
using HID and PS/2 protocol emulation to RMI4.
You will also get:
- BYD PS/2 touchpad protocol support for psmouse
- MELFAS MIP4 Touchscreen driver
- rotary encoder was moved away from legacy platform data and to
generic device properties API, devm_* API, and can now handle
encoders using more than 2 GPIOs
- Cypress touchpad driver was switched to devm_* API and device
properties
- other assorted driver fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (40 commits)
ARM: pxa/raumfeld: use PROPERTY_ENTRY_INTEGER to define props
Input: synaptics-rmi4 - using logical instead of bitwise AND
Input: powermate - fix oops with malicious USB descriptors
Input: snvs_pwrkey - fix returned value check of syscon_regmap_lookup_by_phandle()
MAINTAINERS: add devicetree bindings to Input Drivers section
Input: synaptics-rmi4 - add device tree support to the SPI transport driver
Input: synaptics-rmi4 - add SPI transport driver
Input: synaptics-rmi4 - add support for F30
Input: synaptics-rmi4 - add support for F12
Input: synaptics-rmi4 - add device tree support for 2d sensors and F11
Input: synaptics-rmi4 - add support for 2D sensors and F11
Input: synaptics-rmi4 - add device tree support for RMI4 I2C devices
Input: synaptics-rmi4 - add I2C transport driver
Input: synaptics-rmi4 - add support for Synaptics RMI4 devices
Input: ad7879 - add device tree support
Input: ad7879 - fix default x/y axis assignment
Input: ad7879 - move header to platform_data directory
Input: ts4800 - add hardware dependency
Input: cyapa - fix for losing events during device power transitions
Input: sh_keysc - remove dependency on SUPERH
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatching update from Jiri Kosina:
- cleanup of module notifiers; this depends on a module.c cleanup which
has been acked by Rusty; from Jessica Yu
- small assorted fixes and MAINTAINERS update
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch/module: remove livepatch module notifier
modules: split part of complete_formation() into prepare_coming_module()
livepatch: Update maintainers
livepatch: Fix the error message about unresolvable ambiguity
klp: remove CONFIG_LIVEPATCH dependency from klp headers
klp: remove superfluous errors in asm/livepatch.h
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for kernel v4.6. There is quite a
lot of interesting stuff going on.
The patches to other subsystems and arch-wide are ACKed as far as
possible, though I consider things like per-arch <asm/gpio.h> as
essentially a part of the GPIO subsystem so it should not be needed.
Core changes:
- The gpio_chip is now a *real device*. Until now the gpio chips
were just piggybacking the parent device or (gasp) floating in
space outside of the device model.
We now finally make GPIO chips devices. The gpio_chip will create
a gpio_device which contains a struct device, and this gpio_device
struct is kept private. Anything that needs to be kept private
from the rest of the kernel will gradually be moved over to the
gpio_device.
- As a result of making the gpio_device a real device, we have added
resource management, so devm_gpiochip_add_data() will cut down on
overhead and reduce code lines. A huge slew of patches convert
almost all drivers in the subsystem to use this.
- Building on making the GPIO a real device, we add the first step of
a new userspace ABI: the GPIO character device. We take small
steps here, so we first add a pure *information* ABI and the tool
"lsgpio" that will list all GPIO devices on the system and all
lines on these devices.
We can now discover GPIOs properly from userspace. We still have
not come up with a way to actually *use* GPIOs from userspace.
- To encourage people to use the character device for the future, we
have it always-enabled when using GPIO. The old sysfs ABI is still
opt-in (and can be used in parallel), but is marked as deprecated.
We will keep it around for the foreseeable future, but it will not
be extended to cover ever more use cases.
Cleanup:
- Bjorn Helgaas removed a whole slew of per-architecture <asm/gpio.h>
includes.
This dates back to when GPIO was an opt-in feature and no shared
library even existed: just a header file with proper prototypes was
provided and all semantics were up to the arch to implement. These
patches make the GPIO chip even more a proper device and cleans out
leftovers of the old in-kernel API here and there.
Still some cruft is left but it's very little now.
- There is still some clamping of return values for .get() going on,
but we now return sane values in the vast majority of drivers and
the errorpath is sanitized. Some patches for powerpc, blackfin and
unicore still drop in.
- We continue to switch the ARM, MIPS, blackfin, m68k local GPIO
implementations to use gpiochip_add_data() and cut down on code
lines.
- MPC8xxx is converted to use the generic GPIO helpers.
- ATH79 is converted to use the generic GPIO helpers.
New drivers:
- WinSystems WS16C48
- Acces 104-DIO-48E
- F81866 (a F7188x variant)
- Qoric (a MPC8xxx variant)
- TS-4800
- SPI serializers (pisosr): simple 74xx shift registers connected to
SPI to obtain a dirt-cheap output-only GPIO expander.
- Texas Instruments TPIC2810
- Texas Instruments TPS65218
- Texas Instruments TPS65912
- X-Gene (ARM64) standby GPIO controller"
* tag 'gpio-v4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (194 commits)
Revert "Share upstreaming patches"
gpio: mcp23s08: Fix clearing of interrupt.
gpiolib: Fix comment referring to gpio_*() in gpiod_*()
gpio: pca953x: Fix pca953x_gpio_set_multiple() on 64-bit
gpio: xgene: Fix kconfig for standby GIPO contoller
gpio: Add generic serializer DT binding
gpio: uapi: use 0xB4 as ioctl() major
gpio: tps65912: fix bad merge
Revert "gpio: lp3943: Drop pin_used and lp3943_gpio_request/lp3943_gpio_free"
gpio: omap: drop dev field from gpio_bank structure
gpio: mpc8xxx: Slightly update the code for better readability
gpio: mpc8xxx: Remove *read_reg and *write_reg from struct mpc8xxx_gpio_chip
gpio: mpc8xxx: Fixup setting gpio direction output
gpio: mcp23s08: Add support for mcp23s18
dt-bindings: gpio: altera: Fix altr,interrupt-type property
gpio: add driver for MEN 16Z127 GPIO controller
gpio: lp3943: Drop pin_used and lp3943_gpio_request/lp3943_gpio_free
gpio: timberdale: Switch to devm_ioremap_resource()
gpio: ts4800: Add IMX51 dependency
gpiolib: rewrite gpiodev_add_to_list
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
"Here are the main arm64 updates for 4.6. There are some relatively
intrusive changes to support KASLR, the reworking of the kernel
virtual memory layout and initial page table creation.
Summary:
- Initial page table creation reworked to avoid breaking large block
mappings (huge pages) into smaller ones. The ARM architecture
requires break-before-make in such cases to avoid TLB conflicts but
that's not always possible on live page tables
- Kernel virtual memory layout: the kernel image is no longer linked
to the bottom of the linear mapping (PAGE_OFFSET) but at the bottom
of the vmalloc space, allowing the kernel to be loaded (nearly)
anywhere in physical RAM
- Kernel ASLR: position independent kernel Image and modules being
randomly mapped in the vmalloc space with the randomness is
provided by UEFI (efi_get_random_bytes() patches merged via the
arm64 tree, acked by Matt Fleming)
- Implement relative exception tables for arm64, required by KASLR
(initial code for ARCH_HAS_RELATIVE_EXTABLE added to lib/extable.c
but actual x86 conversion to deferred to 4.7 because of the merge
dependencies)
- Support for the User Access Override feature of ARMv8.2: this
allows uaccess functions (get_user etc.) to be implemented using
LDTR/STTR instructions. Such instructions, when run by the kernel,
perform unprivileged accesses adding an extra level of protection.
The set_fs() macro is used to "upgrade" such instruction to
privileged accesses via the UAO bit
- Half-precision floating point support (part of ARMv8.2)
- Optimisations for CPUs with or without a hardware prefetcher (using
run-time code patching)
- copy_page performance improvement to deal with 128 bytes at a time
- Sanity checks on the CPU capabilities (via CPUID) to prevent
incompatible secondary CPUs from being brought up (e.g. weird
big.LITTLE configurations)
- valid_user_regs() reworked for better sanity check of the
sigcontext information (restored pstate information)
- ACPI parking protocol implementation
- CONFIG_DEBUG_RODATA enabled by default
- VDSO code marked as read-only
- DEBUG_PAGEALLOC support
- ARCH_HAS_UBSAN_SANITIZE_ALL enabled
- Erratum workaround Cavium ThunderX SoC
- set_pte_at() fix for PROT_NONE mappings
- Code clean-ups"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (99 commits)
arm64: kasan: Fix zero shadow mapping overriding kernel image shadow
arm64: kasan: Use actual memory node when populating the kernel image shadow
arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission
arm64: Fix misspellings in comments.
arm64: efi: add missing frame pointer assignment
arm64: make mrs_s prefixing implicit in read_cpuid
arm64: enable CONFIG_DEBUG_RODATA by default
arm64: Rework valid_user_regs
arm64: mm: check at build time that PAGE_OFFSET divides the VA space evenly
arm64: KVM: Move kvm_call_hyp back to its original localtion
arm64: mm: treat memstart_addr as a signed quantity
arm64: mm: list kernel sections in order
arm64: lse: deal with clobbered IP registers after branch via PLT
arm64: mm: dump: Use VA_START directly instead of private LOWEST_ADDR
arm64: kconfig: add submenu for 8.2 architectural features
arm64: kernel: acpi: fix ioremap in ACPI parking protocol cpu_postboot
arm64: Add support for Half precision floating point
arm64: Remove fixmap include fragility
arm64: Add workaround for Cavium erratum 27456
arm64: mm: Mark .rodata as RO
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux
Pull pstore update from Tony Luck:
"Allow ram backend to be configured with addresses above 4GB"
* tag 'please-pull-pstore' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux:
pstore: Add support for 64 Bit address space
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Performance improvements in SEEK_DATA and xattr scalability
improvements, plus a lot of clean ups and bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
ext4: clean up error handling in the MMP support
jbd2: do not fail journal because of frozen_buffer allocation failure
ext4: use __GFP_NOFAIL in ext4_free_blocks()
ext4: fix compile error while opening the macro DOUBLE_CHECK
ext4: print ext4 mount option data_err=abort correctly
ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
ext4: fix misspellings in comments.
jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
ext4: more efficient SEEK_DATA implementation
ext4: cleanup handling of bh->b_state in DAX mmap
ext4: return hole from ext4_map_blocks()
ext4: factor out determining of hole size
ext4: fix setting of referenced bit in ext4_es_lookup_extent()
ext4: remove i_ioend_count
ext4: simplify io_end handling for AIO DIO
ext4: move trans handling and completion deferal out of _ext4_get_block
ext4: rename and split get blocks functions
ext4: use i_mutex to serialize unaligned AIO DIO
ext4: pack ioend structure better
...
|
|
Pull configfs updates from Christoph Hellwig:
- A large patch from me to simplify setting up the list of default
groups by actually implementing it as a list instead of an array.
- a small Y2083 prep patch from Deepa Dinamani. Probably doesn't
matter on it's own, but it seems like he is trying to get rid of all
CURRENT_TIME uses in file systems, which is a worthwhile goal.
* tag 'configfs-for-linus' of git://git.infradead.org/users/hch/configfs:
configfs: switch ->default groups to a linked list
configfs: Replace CURRENT_TIME by current_fs_time()
|
|
This changes several users of manual "on"/"off" parsing to use
strtobool.
Some side-effects:
- these uses will now parse y/n/1/0 meaningfully too
- the early_param uses will now bubble up parse errors
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Joe Perches <joe@perches.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steve French <sfrench@samba.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Create the kstrtobool_from_user() helper and move strtobool() logic into
the new kstrtobool() (matching all the other kstrto* functions).
Provides an inline wrapper for existing strtobool() callers.
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Joe Perches <joe@perches.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Steve French <sfrench@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
<get_unaligned_be16> (24 copies, 108 calls):
66 8b 07 mov (%rdi),%ax
55 push %rbp
48 89 e5 mov %rsp,%rbp
86 e0 xchg %ah,%al
5d pop %rbp
c3 retq
<get_unaligned_be32> (25 copies, 181 calls):
8b 07 mov (%rdi),%eax
55 push %rbp
48 89 e5 mov %rsp,%rbp
0f c8 bswap %eax
5d pop %rbp
c3 retq
<get_unaligned_be64> (23 copies, 94 calls):
48 8b 07 mov (%rdi),%rax
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 0f c8 bswap %rax
5d pop %rbp
c3 retq
<put_unaligned_be16> (2 copies, 11 calls):
89 f8 mov %edi,%eax
55 push %rbp
c1 ef 08 shr $0x8,%edi
c1 e0 08 shl $0x8,%eax
09 c7 or %eax,%edi
48 89 e5 mov %rsp,%rbp
66 89 3e mov %di,(%rsi)
<put_unaligned_be32> (8 copies, 43 calls):
55 push %rbp
0f cf bswap %edi
89 3e mov %edi,(%rsi)
48 89 e5 mov %rsp,%rbp
5d pop %rbp
c3 retq
<put_unaligned_be64> (26 copies, 157 calls):
55 push %rbp
48 0f cf bswap %rdi
48 89 3e mov %rdi,(%rsi)
48 89 e5 mov %rsp,%rbp
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
It only affects arches with efficient unaligned access insns, such as x86.
(arched which lack such ops do not include linux/unaligned/access_ok.h)
Code size decrease after the patch is ~8.5k:
text data bss dec hex filename
92197848 20826112 36417536 149441496 8e84bd8 vmlinux
92189231 20826144 36417536 149432911 8e82a4f vmlinux6_unaligned_be_after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Occasionally we have to search for an occurrence of a string in an array
of strings. Make a simple helper for that purpose.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Heikki Krogerus <heikki.krogerus@linux.intel.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
shmem likes to occasionally drop the lock, schedule, then reacqire the
lock and continue with the iteration from the last place it left off.
This is currently done with a pretty ugly goto. Introduce
radix_tree_iter_next() and use it throughout shmem.c.
[koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With huge pages, it is convenient to have the radix tree be able to
return an entry that covers multiple indices. Previous attempts to deal
with the problem have involved inserting N duplicate entries, which is a
waste of memory and leads to problems trying to handle aliased tags, or
probing the tree multiple times to find alternative entries which might
cover the requested index.
This approach inserts one canonical entry into the tree for a given
range of indices, and may also insert other entries in order to ensure
that lookups find the canonical entry.
This solution only tolerates inserting powers of two that are greater
than the fanout of the tree. If we wish to expand the radix tree's
abilities to support large-ish pages that is less than the fanout at the
penultimate level of the tree, then we would need to add one more step
in lookup to ensure that any sibling nodes in the final level of the
tree are dereferenced and we return the canonical entry that they
reference.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The radix-tree header uses the __ffs() function, which is defined in
bitops.h. The current kernel headers implicitly include bitops.h, but
the userspace test harness does not.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
hlist_bl_unhashed() and hlist_bl_empty() are all boolean functions, so
return bool instead of int.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are various email addresses for me throughout the kernel. Use the
one that will always be valid.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patchset introduces a /proc/<pid>/timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).
The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.
The second patch introduces the /proc/<pid>/timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.
With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:
$ time sleep 1
real 0m10.747s
user 0m0.001s
sys 0m0.005s
The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.
Other than that things are fairly straightforward.
This patch (of 2):
The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).
This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.
Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.
This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.
Signed-off-by: John Stultz <john.stultz@linaro.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oren Laadan <orenl@cellrox.com>
Cc: Ruchi Kandoi <kandoiruchi@google.com>
Cc: Rom Lemarchand <romlem@android.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Android Kernel Team <kernel-team@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
freeze_page() and unfreeze_page() helpers evolved in rather complex
beasts. It would be nice to cut complexity of this code.
This patch rewrites freeze_page() using standard try_to_unmap().
unfreeze_page() is rewritten with remove_migration_ptes().
The result is much simpler.
But the new variant is somewhat slower for PTE-mapped THPs. Current
helpers iterates over VMAs the compound page is mapped to, and then over
ptes within this VMA. New helpers iterates over small page, then over
VMA the small page mapped to, and only then find relevant pte.
We have short cut for PMD-mapped THP: we directly install migration
entries on PMD split.
I don't think the slowdown is critical, considering how much simpler
result is and that split_huge_page() is quite rare nowadays. It only
happens due memory pressure or migration.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Make remove_migration_ptes() available to be used in split_huge_page().
New parameter 'locked' added: as with try_to_umap() we need a way to
indicate that caller holds rmap lock.
We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP
pages are never mlocked.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add support for two ttu_flags:
- TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to
unmap page;
- TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock;
Also, change rwc->done to !page_mapcount() instead of !page_mapped().
try_to_unmap() works on pte level, so we are really interested in the
mappedness of this small page rather than of the compound page it's a
part of.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patchset rewrites freeze_page() and unfreeze_page() using
try_to_unmap() and remove_migration_ptes(). Result is much simpler, but
somewhat slower.
Migration 8GiB worth of PMD-mapped THP:
Baseline 20.21 +/- 0.393
Patched 20.73 +/- 0.082
Slowdown 1.03x
It's 3% slower, comparing to 14% in v1. I don't it should be a stopper.
Splitting of PTE-mapped pages slowed more. But this is not a common
case.
Migration 8GiB worth of PMD-mapped THP:
Baseline 20.39 +/- 0.225
Patched 22.43 +/- 0.496
Slowdown 1.10x
rmap_walk_locked() is the same as rmap_walk(), but the caller takes care
of the relevant rmap lock.
This is preparation for switching THP splitting from custom rmap walk in
freeze_page()/unfreeze_page() to the generic one.
There is no support for KSM pages for now: not clear which lock is
implied.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The define has a comment from Nick Piggin from 2007:
/* For backwards compat. Remove me quickly. */
I guess 9 years should not be too hurried sense of 'quickly' even for
kernel measures.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.
The GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build. We need to be careful to only build the table for zones that
have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
purpose. This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.
Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
consume another bit in page->flags (expand ZONES_WIDTH) with room to
spare.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Mark <markk@clara.co.uk>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CMA allocation should be guaranteed to succeed by definition, but,
unfortunately, it would be failed sometimes. It is hard to track down
the problem, because it is related to page reference manipulation and we
don't have any facility to analyze it.
This patch adds tracepoints to track down page reference manipulation.
With it, we can find exact reason of failure and can fix the problem.
Following is an example of tracepoint output. (note: this example is
stale version that printing flags as the number. Recent version will
print it as human readable string.)
<...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
<...>-9018 [004] 92.678378: kernel_stack:
=> get_page_from_freelist (ffffffff81176659)
=> __alloc_pages_nodemask (ffffffff81176d22)
=> alloc_pages_vma (ffffffff811bf675)
=> handle_mm_fault (ffffffff8119e693)
=> __do_page_fault (ffffffff810631ea)
=> trace_do_page_fault (ffffffff81063543)
=> do_async_page_fault (ffffffff8105c40a)
=> async_page_fault (ffffffff817581d8)
[snip]
<...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
[snip]
...
...
<...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
[snip]
<...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
=> release_pages (ffffffff8117c9e4)
=> free_pages_and_swap_cache (ffffffff811b0697)
=> tlb_flush_mmu_free (ffffffff81199616)
=> tlb_finish_mmu (ffffffff8119a62c)
=> exit_mmap (ffffffff811a53f7)
=> mmput (ffffffff81073f47)
=> do_exit (ffffffff810794e9)
=> do_group_exit (ffffffff81079def)
=> SyS_exit_group (ffffffff81079e74)
=> entry_SYSCALL_64_fastpath (ffffffff817560b6)
This output shows that problem comes from exit path. In exit path, to
improve performance, pages are not freed immediately. They are gathered
and processed by batch. During this process, migration cannot be
possible and CMA allocation is failed. This problem is hard to find
without this page reference tracepoint facility.
Enabling this feature bloat kernel text 30 KB in my configuration.
text data bss dec hex filename
12127327 2243616 1507328 15878271 f2487f vmlinux_disabled
12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled
Note that, due to header file dependency problem between mm.h and
tracepoint.h, this feature has to open code the static key functions for
tracepoints. Proposed by Steven Rostedt in following link.
https://lkml.org/lkml/2015/12/9/699
[arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
[iamjoonsoo.kim@lge.com: fix build failure for xtensa]
[akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.
In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.
In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
THP defrag is enabled by default to direct reclaim/compact but not wake
kswapd in the event of a THP allocation failure. The problem is that
THP allocation requests potentially enter reclaim/compaction. This
potentially incurs a severe stall that is not guaranteed to be offset by
reduced TLB misses. While there has been considerable effort to reduce
the impact of reclaim/compaction, it is still a high cost and workloads
that should fit in memory fail to do so. Specifically, a simple
anon/file streaming workload will enter direct reclaim on NUMA at least
even though the working set size is 80% of RAM. It's been years and
it's time to throw in the towel.
First, this patch defines THP defrag as follows;
madvise: A failed allocation will direct reclaim/compact if the application requests it
never: Neither reclaim/compact nor wake kswapd
defer: A failed allocation will wake kswapd/kcompactd
always: A failed allocation will direct reclaim/compact (historical behaviour)
khugepaged defrag will enter direct/reclaim but not wake kswapd.
Next it sets the default defrag option to be "madvise" to only enter
direct reclaim/compaction for applications that specifically requested
it.
Lastly, it removes a check from the page allocator slowpath that is
related to __GFP_THISNODE to allow "defer" to work. The callers that
really cares are slub/slab and they are updated accordingly. The slab
one may be surprising because it also corrects a comment as kswapd was
never woken up by that path.
This means that a THP fault will no longer stall for most applications
by default and the ideal for most users that get THP if they are
immediately available. There are still options for users that prefer a
stall at startup of a new application by either restoring historical
behaviour with "always" or pick a half-way point with "defer" where
kswapd does some of the work in the background and wakes kcompactd if
necessary. THP defrag for khugepaged remains enabled and will enter
direct/reclaim but no wakeup kswapd or kcompactd.
After this patch a THP allocation failure will quickly fallback and rely
on khugepaged to recover the situation at some time in the future. In
some cases, this will reduce THP usage but the benefit of THP is hard to
measure and not a universal win where as a stall to reclaim/compaction
is definitely measurable and can be painful.
The first test for this is using "usemem" to read a large file and write
a large anonymous mapping (to avoid the zero page) multiple times. The
total size of the mappings is 80% of RAM and the benchmark simply
measures how long it takes to complete. It uses multiple threads to see
if that is a factor. On UMA, the performance is almost identical so is
not reported but on NUMA, we see this
usemem
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)
For a single thread, the benchmark completes 43.23% faster with this
patch applied with smaller benefits as the thread increases. Similar,
notice the large reduction in most cases in system CPU usage. The
overall CPU time is
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
User 10357.65 10438.33
System 3988.88 3543.94
Elapsed 2203.01 1634.41
Which is substantial. Now, the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 128458477 278352931
Major Faults 2174976 225
Swap Ins 16904701 0
Swap Outs 17359627 0
Allocation stalls 43611 0
DMA allocs 0 0
DMA32 allocs 19832646 19448017
Normal allocs 614488453 580941839
Movable allocs 0 0
Direct pages scanned 24163800 0
Kswapd pages scanned 0 0
Kswapd pages reclaimed 0 0
Direct pages reclaimed 20691346 0
Compaction stalls 42263 0
Compaction success 938 0
Compaction failures 41325 0
This patch eliminates almost all swapping and direct reclaim activity.
There is still overhead but it's from NUMA balancing which does not
identify that it's pointless trying to do anything with this workload.
I also tried the thpscale benchmark which forces a corner case where
compaction can be used heavily and measures the latency of whether base
or huge pages were used
thpscale Fault Latencies
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)
The average time to fault pages is substantially reduced in the majority
of caseds but with the obvious caveat that fewer THPs are actually used
in this adverse workload
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 37429143 47564000
Major Faults 1916 1558
Swap Ins 1466 1079
Swap Outs 2936863 149626
Allocation stalls 62510 3
DMA allocs 0 0
DMA32 allocs 6566458 6401314
Normal allocs 216361697 216538171
Movable allocs 0 0
Direct pages scanned 25977580 17998
Kswapd pages scanned 0 3638931
Kswapd pages reclaimed 0 207236
Direct pages reclaimed 8833714 88
Compaction stalls 103349 5
Compaction success 270 4
Compaction failures 103079 1
Note again that while this does swap as it's an aggressive workload, the
direct relcim activity and allocation stalls is substantially reduced.
There is some kswapd activity but ftrace showed that the kswapd activity
was due to normal wakeups from 4K pages being allocated.
Compaction-related stalls and activity are almost eliminated.
I also tried the stutter benchmark. For this, I do not have figures for
NUMA but it's something that does impact UMA so I'll report what is
available
stutter
4.4.0 4.4.0
kcompactd-v1r1 nodefrag-v1r3
Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)
This benchmark is trying to fault an anonymous mapping while there is a
heavy IO load -- a scenario that desktop users used to complain about
frequently. This shows a mix because the ideal case of mapping with THP
is not hit as often. However, note that 99% of the mappings complete
13.79% faster. The CPU usage here is particularly interesting
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
User 67.50 0.99
System 1327.88 91.30
Elapsed 2079.00 2128.98
And once again we look at the reclaim figures
4.4.0 4.4.0
kcompactd-v1r1nodefrag-v1r3
Minor Faults 335241922 1314582827
Major Faults 715 819
Swap Ins 0 0
Swap Outs 0 0
Allocation stalls 532723 0
DMA allocs 0 0
DMA32 allocs 1822364341 1177950222
Normal allocs 1815640808 1517844854
Movable allocs 0 0
Direct pages scanned 21892772 0
Kswapd pages scanned 20015890 41879484
Kswapd pages reclaimed 19961986 41822072
Direct pages reclaimed 21892741 0
Compaction stalls 1065755 0
Compaction success 514 0
Compaction failures 1065241 0
Allocation stalls and all direct reclaim activity is eliminated as well
as compaction-related stalls.
THP gives impressive gains in some cases but only if they are quickly
available. We're not going to reach the point where they are completely
free so lets take the costs out of the fast paths finally and defer the
cost to kswapd, kcompactd and khugepaged where it belongs.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since __GFP_NOACCOUNT was removed by commit 20b5c3039863 ("Revert 'gfp:
add __GFP_NOACCOUNT'"), its description is not necessary.
Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In machines with 140G of memory and enterprise flash storage, we have
seen read and write bursts routinely exceed the kswapd watermarks and
cause thundering herds in direct reclaim. Unfortunately, the only way
to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
system's emergency reserves - which is entirely unrelated to the
system's latency requirements. In order to get kswapd to maintain a
250M buffer of free memory, the emergency reserves need to be set to 1G.
That is a lot of memory wasted for no good reason.
On the other hand, it's reasonable to assume that allocation bursts and
overall allocation concurrency scale with memory capacity, so it makes
sense to make kswapd aggressiveness a function of that as well.
Change the kswapd watermark scale factor from the currently fixed 25% of
the tunable emergency reserve to a tunable 0.1% of memory.
Beyond 1G of memory, this will produce bigger watermark steps than the
current formula in default settings. Ensure that the new formula never
chooses steps smaller than that, i.e. 25% of the emergency reserve.
On a 140G machine, this raises the default watermark steps - the
distance between min and low, and low and high - from 16M to 143M.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are few things about *pte_alloc*() helpers worth cleaning up:
- 'vma' argument is unused, let's drop it;
- most __pte_alloc() callers do speculative check for pmd_none(),
before taking ptl: let's introduce pte_alloc() macro which does
the check.
The only direct user of __pte_alloc left is userfaultfd, which has
different expectation about atomicity wrt pmd.
- pte_alloc_map() and pte_alloc_map_lock() are redefined using
pte_alloc().
[sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
[sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
statistics protocol, corresponding to 'Available' in /proc/meminfo.
It indicates to the hypervisor how big the balloon can be inflated
without pushing the guest system to swap. This metric would be very
useful in VM orchestration software to improve memory management of
different VMs under overcommit.
This patch (of 2):
Factor out calculation of the available memory counter into a separate
exportable function, in order to be able to use it in other parts of the
kernel.
In particular, it appears a relevant metric to report to the hypervisor
via virtio-balloon statistics interface (in a followup patch).
Signed-off-by: Igor Redko <redkoi@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Get list of VMA flags up-to-date and sort it to match VM_* definition
order.
[vbabka@suse.cz: add a note above vmaflag definitions to update the names when changing]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
max_map_count sysctl unrelated to scheduler. Move its bits from
include/linux/sched/sysctl.h to include/linux/mm.h.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Count how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.
Rapidly growing value can indicate that an application behaves
unfriendly wrt THP: often fault in huge page and then unmap part of it.
This leads to unnecessary memory fragmentation and the application may
require tuning.
The event also can help with debugging kernel [mis-]behaviour.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Workingset code was recently made memcg aware, but shadow node shrinker
is still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node
shrinker memcg aware.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.
There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
only be called if memcg_kmem_enabled() returned true. Since the cgroup
it operates on is online, mem_cgroup_is_root() check will be enough.
memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
of memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just
open-code the check.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
<SetPageUptodate> (43 copies, 141 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 0f 08 lock orb $0x8,(%rdi)
5d pop %rbp
c3 retq
<PagePrivate> (10 copies, 134 calls):
48 8b 07 mov (%rdi),%rax
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 c1 e8 0b shr $0xb,%rax
83 e0 01 and $0x1,%eax
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
Code size decrease after the patch is ~7k:
text data bss dec hex filename
92125002 20826048 36417536 149368586 8e72f0a vmlinux
92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
about 60 times. Examples of disassembly:
<set_buffer_mapped> (14 copies, 43 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 0f 20 lock orb $0x20,(%rdi)
5d pop %rbp
c3 retq
<buffer_mapped> (3 copies, 34 calls):
48 8b 07 mov (%rdi),%rax
55 push %rbp
48 89 e5 mov %rsp,%rbp
48 c1 e8 05 shr $0x5,%rax
83 e0 01 and $0x1,%eax
5d pop %rbp
c3 retq
<set_buffer_new> (5 copies, 13 calls):
55 push %rbp
48 89 e5 mov %rsp,%rbp
f0 80 0f 40 lock orb $0x40,(%rdi)
5d pop %rbp
c3 retq
This patch fixes this via s/inline/__always_inline/.
This decreases vmlinux by about 3 kbytes.
text data bss dec hex filename
88200439 19905208 36421632 144527279 89d4faf vmlinux2
88197239 19905240 36421632 144524111 89d434f vmlinux
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
is inconvenient when grasping how free pages are distributed. This
patch sets KPF_BUDDY for such pages.
With this patch:
$ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
MemFree: 3134992 kB
flags page-count MB symbolic-flags long-symbolic-flags
0x0000000000000400 779272 3044 __________B_______________________________ buddy
0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap
total 783657 3061
783657 pages is 3134628 kB (roughly consistent with the global counter,)
so it's OK.
[akpm@linux-foundation.org: update comment, per Naoya]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Show how much memory is allocated to kernel stacks.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Show how much memory is used for storing reclaimable and unreclaimable
in-kernel data structures allocated from slab caches.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|