summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2016-12-13Merge tag 'pm-4.10-rc1' of ↵Linus Torvalds1-1/+8
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "Again, cpufreq gets more changes than the other parts this time (one new driver, one old driver less, a bunch of enhancements of the existing code, new CPU IDs, fixes, cleanups) There also are some changes in cpuidle (idle injection rework, a couple of new CPU IDs, online/offline rework in intel_idle, fixes and cleanups), in the generic power domains framework (mostly related to supporting power domains containing CPUs), and in the Operating Performance Points (OPP) library (mostly related to supporting devices with multiple voltage regulators) In addition to that, the system sleep state selection interface is modified to make it easier for distributions with unchanged user space to support suspend-to-idle as the default system suspend method, some issues are fixed in the PM core, the latency tolerance PM QoS framework is improved a bit, the Intel RAPL power capping driver is cleaned up and there are some fixes and cleanups in the devfreq subsystem Specifics: - New cpufreq driver for Broadcom STB SoCs and a Device Tree binding for it (Markus Mayer) - Support for ARM Integrator/AP and Integrator/CP in the generic DT cpufreq driver and elimination of the old Integrator cpufreq driver (Linus Walleij) - Support for the zx296718, r8a7743 and r8a7745, Socionext UniPhier, and PXA SoCs in the the generic DT cpufreq driver (Baoyou Xie, Geert Uytterhoeven, Masahiro Yamada, Robert Jarzmik) - cpufreq core fix to eliminate races that may lead to using inactive policy objects and related cleanups (Rafael Wysocki) - cpufreq schedutil governor update to make it use SCHED_FIFO kernel threads (instead of regular workqueues) for doing delayed work (to reduce the response latency in some cases) and related cleanups (Viresh Kumar) - New cpufreq sysfs attribute for resetting statistics (Markus Mayer) - cpufreq governors fixes and cleanups (Chen Yu, Stratos Karafotis, Viresh Kumar) - Support for using generic cpufreq governors in the intel_pstate driver (Rafael Wysocki) - Support for per-logical-CPU P-state limits and the EPP/EPB (Energy Performance Preference/Energy Performance Bias) knobs in the intel_pstate driver (Srinivas Pandruvada) - New CPU ID for Knights Mill in intel_pstate (Piotr Luc) - intel_pstate driver modification to use the P-state selection algorithm based on CPU load on platforms with the system profile in the ACPI tables set to "mobile" (Srinivas Pandruvada) - intel_pstate driver cleanups (Arnd Bergmann, Rafael Wysocki, Srinivas Pandruvada) - cpufreq powernv driver updates including fast switching support (for the schedutil governor), fixes and cleanus (Akshay Adiga, Andrew Donnellan, Denis Kirjanov) - acpi-cpufreq driver rework to switch it over to the new CPU offline/online state machine (Sebastian Andrzej Siewior) - Assorted cleanups in cpufreq drivers (Wei Yongjun, Prashanth Prakash) - Idle injection rework (to make it use the regular idle path instead of a home-grown custom one) and related powerclamp thermal driver updates (Peter Zijlstra, Jacob Pan, Petr Mladek, Sebastian Andrzej Siewior) - New CPU IDs for Atom Z34xx and Knights Mill in intel_idle (Andy Shevchenko, Piotr Luc) - intel_idle driver cleanups and switch over to using the new CPU offline/online state machine (Anna-Maria Gleixner, Sebastian Andrzej Siewior) - cpuidle DT driver update to support suspend-to-idle properly (Sudeep Holla) - cpuidle core cleanups and misc updates (Daniel Lezcano, Pan Bian, Rafael Wysocki) - Preliminary support for power domains including CPUs in the generic power domains (genpd) framework and related DT bindings (Lina Iyer) - Assorted fixes and cleanups in the generic power domains (genpd) framework (Colin Ian King, Dan Carpenter, Geert Uytterhoeven) - Preliminary support for devices with multiple voltage regulators and related fixes and cleanups in the Operating Performance Points (OPP) library (Viresh Kumar, Masahiro Yamada, Stephen Boyd) - System sleep state selection interface rework to make it easier to support suspend-to-idle as the default system suspend method (Rafael Wysocki) - PM core fixes and cleanups, mostly related to the interactions between the system suspend and runtime PM frameworks (Ulf Hansson, Sahitya Tummala, Tony Lindgren) - Latency tolerance PM QoS framework imorovements (Andrew Lutomirski) - New Knights Mill CPU ID for the Intel RAPL power capping driver (Piotr Luc) - Intel RAPL power capping driver fixes, cleanups and switch over to using the new CPU offline/online state machine (Jacob Pan, Thomas Gleixner, Sebastian Andrzej Siewior) - Fixes and cleanups in the exynos-ppmu, exynos-nocp, rk3399_dmc, rockchip-dfi devfreq drivers and the devfreq core (Axel Lin, Chanwoo Choi, Javier Martinez Canillas, MyungJoo Ham, Viresh Kumar) - Fix for false-positive KASAN warnings during resume from ACPI S3 (suspend-to-RAM) on x86 (Josh Poimboeuf) - Memory map verification during resume from hibernation on x86 to ensure a consistent address space layout (Chen Yu) - Wakeup sources debugging enhancement (Xing Wei) - rockchip-io AVS driver cleanup (Shawn Lin)" * tag 'pm-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (127 commits) devfreq: rk3399_dmc: Don't use OPP structures outside of RCU locks devfreq: rk3399_dmc: Remove dangling rcu_read_unlock() devfreq: exynos: Don't use OPP structures outside of RCU locks Documentation: intel_pstate: Document HWP energy/performance hints cpufreq: intel_pstate: Support for energy performance hints with HWP cpufreq: intel_pstate: Add locking around HWP requests PM / sleep: Print active wakeup sources when blocking on wakeup_count reads PM / core: Fix bug in the error handling of async suspend PM / wakeirq: Fix dedicated wakeirq for drivers not using autosuspend PM / Domains: Fix compatible for domain idle state PM / OPP: Don't WARN on multiple calls to dev_pm_opp_set_regulators() PM / OPP: Allow platform specific custom set_opp() callbacks PM / OPP: Separate out _generic_set_opp() PM / OPP: Add infrastructure to manage multiple regulators PM / OPP: Pass struct dev_pm_opp_supply to _set_opp_voltage() PM / OPP: Manage supply's voltage/current in a separate structure PM / OPP: Don't use OPP structure outside of rcu protected section PM / OPP: Reword binding supporting multiple regulators per device PM / OPP: Fix incorrect cpu-supply property in binding cpuidle: Add a kerneldoc comment to cpuidle_use_deepest_state() ..
2016-12-13Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-blockLinus Torvalds3-4/+3
Pull block layer updates from Jens Axboe: "This is the main block pull request this series. Contrary to previous release, I've kept the core and driver changes in the same branch. We always ended up having dependencies between the two for obvious reasons, so makes more sense to keep them together. That said, I'll probably try and keep more topical branches going forward, especially for cycles that end up being as busy as this one. The major parts of this pull request is: - Improved support for O_DIRECT on block devices, with a small private implementation instead of using the pig that is fs/direct-io.c. From Christoph. - Request completion tracking in a scalable fashion. This is utilized by two components in this pull, the new hybrid polling and the writeback queue throttling code. - Improved support for polling with O_DIRECT, adding a hybrid mode that combines pure polling with an initial sleep. From me. - Support for automatic throttling of writeback queues on the block side. This uses feedback from the device completion latencies to scale the queue on the block side up or down. From me. - Support from SMR drives in the block layer and for SD. From Hannes and Shaun. - Multi-connection support for nbd. From Josef. - Cleanup of request and bio flags, so we have a clear split between which are bio (or rq) private, and which ones are shared. From Christoph. - A set of patches from Bart, that improve how we handle queue stopping and starting in blk-mq. - Support for WRITE_ZEROES from Chaitanya. - Lightnvm updates from Javier/Matias. - Supoort for FC for the nvme-over-fabrics code. From James Smart. - A bunch of fixes from a whole slew of people, too many to name here" * 'for-4.10/block' of git://git.kernel.dk/linux-block: (182 commits) blk-stat: fix a few cases of missing batch flushing blk-flush: run the queue when inserting blk-mq flush elevator: make the rqhash helpers exported blk-mq: abstract out blk_mq_dispatch_rq_list() helper blk-mq: add blk_mq_start_stopped_hw_queue() block: improve handling of the magic discard payload blk-wbt: don't throttle discard or write zeroes nbd: use dev_err_ratelimited in io path nbd: reset the setup task for NBD_CLEAR_SOCK nvme-fabrics: Add FC LLDD loopback driver to test FC-NVME nvme-fabrics: Add target support for FC transport nvme-fabrics: Add host support for FC transport nvme-fabrics: Add FC transport LLDD api definitions nvme-fabrics: Add FC transport FC-NVME definitions nvme-fabrics: Add FC transport error codes to nvme.h Add type 0x28 NVME type code to scsi fc headers nvme-fabrics: patch target code in prep for FC transport support nvme-fabrics: set sqe.command_id in core not transports parser: add u64 number parser nvme-rdma: align to generic ib_event logging helper ...
2016-12-13Merge branch 'akpm' (patches from Andrew)Linus Torvalds33-583/+727
Merge updates from Andrew Morton: - various misc bits - most of MM (quite a lot of MM material is awaiting the merge of linux-next dependencies) - kasan - printk updates - procfs updates - MAINTAINERS - /lib updates - checkpatch updates * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits) init: reduce rootwait polling interval time to 5ms binfmt_elf: use vmalloc() for allocation of vma_filesz checkpatch: don't emit unified-diff error for rename-only patches checkpatch: don't check c99 types like uint8_t under tools checkpatch: avoid multiple line dereferences checkpatch: don't check .pl files, improve absolute path commit log test scripts/checkpatch.pl: fix spelling checkpatch: don't try to get maintained status when --no-tree is given lib/ida: document locking requirements a bit better lib/rbtree.c: fix typo in comment of ____rb_erase_color lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM MAINTAINERS: add drm and drm/i915 irc channels MAINTAINERS: add "C:" for URI for chat where developers hang out MAINTAINERS: add drm and drm/i915 bug filing info MAINTAINERS: add "B:" for URI where to file bugs get_maintainer: look for arbitrary letter prefixes in sections printk: add Kconfig option to set default console loglevel printk/sound: handle more message headers printk/btrfs: handle more message headers printk/kdb: handle more message headers ...
2016-12-13Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds7-294/+172
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp hotplug updates from Thomas Gleixner: "This is the final round of converting the notifier mess to the state machine. The removal of the notifiers and the related infrastructure will happen around rc1, as there are conversions outstanding in other trees. The whole exercise removed about 2000 lines of code in total and in course of the conversion several dozen bugs got fixed. The new mechanism allows to test almost every hotplug step standalone, so usage sites can exercise all transitions extensively. There is more room for improvement, like integrating all the pointlessly different architecture mechanisms of synchronizing, setting cpus online etc into the core code" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) tracing/rb: Init the CPU mask on allocation soc/fsl/qbman: Convert to hotplug state machine soc/fsl/qbman: Convert to hotplug state machine zram: Convert to hotplug state machine KVM/PPC/Book3S HV: Convert to hotplug state machine arm64/cpuinfo: Convert to hotplug state machine arm64/cpuinfo: Make hotplug notifier symmetric mm/compaction: Convert to hotplug state machine iommu/vt-d: Convert to hotplug state machine mm/zswap: Convert pool to hotplug state machine mm/zswap: Convert dst-mem to hotplug state machine mm/zsmalloc: Convert to hotplug state machine mm/vmstat: Convert to hotplug state machine mm/vmstat: Avoid on each online CPU loops mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead() tracing/rb: Convert to hotplug state machine oprofile/nmi timer: Convert to hotplug state machine net/iucv: Use explicit clean up labels in iucv_init() x86/pci/amd-bus: Convert to hotplug state machine x86/oprofile/nmi: Convert to hotplug state machine ...
2016-12-13mm/percpu.c: fix panic triggered by BUG_ON() falselyzijun_hu1-4/+12
As shown by pcpu_build_alloc_info(), the number of units within a percpu group is deduced by rounding up the number of CPUs within the group to @upa boundary/ Therefore, the number of CPUs isn't equal to the units's if it isn't aligned to @upa normally. However, pcpu_page_first_chunk() uses BUG_ON() to assert that one number is equal to the other roughly, so a panic is maybe triggered by the BUG_ON() incorrectly. In order to fix this issue, the number of CPUs is rounded up then compared with units's and the BUG_ON() is replaced with a warning and return of an error code as well, to keep system alive as much as possible. Link: http://lkml.kernel.org/r/57FCF07C.2020103@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13kasan: eliminate long stalls during quarantine reductionDmitry Vyukov1-46/+48
Currently we dedicate 1/32 of RAM for quarantine and then reduce it by 1/4 of total quarantine size. This can be a significant amount of memory. For example, with 4GB of RAM total quarantine size is 128MB and it is reduced by 32MB at a time. With 128GB of RAM total quarantine size is 4GB and it is reduced by 1GB. This leads to several problems: - freeing 1GB can take tens of seconds, causes rcu stall warnings and just introduces unexpected long delays at random places - if kmalloc() is called under a mutex, other threads stall on that mutex while a thread reduces quarantine - threads wait on quarantine_lock while one thread grabs a large batch of objects to evict - we walk the uncached list of object to free twice which makes all of the above worse - when a thread frees objects, they are already not accounted against global_quarantine.bytes; as the result we can have quarantine_size bytes in quarantine + unbounded amount of memory in large batches in threads that are in process of freeing Reduce size of quarantine in smaller batches to reduce the delays. The only reason to reduce it in batches is amortization of overheads, the new batch size of 1MB should be well enough to amortize spinlock lock/unlock and few function calls. Plus organize quarantine as a FIFO array of batches. This allows to not walk the list in quarantine_reduce() under quarantine_lock, which in turn reduces contention and is just faster. This improves performance of heavy load (syzkaller fuzzing) by ~20% with 4 CPUs and 32GB of RAM. Also this eliminates frequent (every 5 sec) drops of CPU consumption from ~400% to ~100% (one thread reduces quarantine while others are waiting on a mutex). Some reference numbers: 1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB. Currently we free 32MB at at time. With new code we free 1MB at a time (1024 batches, ~128 are used). 2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB. Currently we free 1GB at at time. With new code we free 8MB at a time (1024 batches, ~512 are used). 3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB. Currently we free 8GB at at time. With new code we free 4MB at a time (16K batches, ~8K are used). Link: http://lkml.kernel.org/r/1478756952-18695-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13kasan: support panic_on_warnDmitry Vyukov1-0/+2
If user sets panic_on_warn, he wants kernel to panic if there is anything barely wrong with the kernel. KASAN-detected errors are definitely not less benign than an arbitrary kernel WARNING. Panic after KASAN errors if panic_on_warn is set. We use this for continuous fuzzing where we want kernel to stop and reboot on any error. Link: http://lkml.kernel.org/r/1476694764-31986-1-git-send-email-dvyukov@google.com Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: make transparent hugepage size publicHugh Dickins1-0/+10
Test programs want to know the size of a transparent hugepage. While it is commonly the same as the size of a hugetlbfs page (shown as Hugepagesize in /proc/meminfo), that is not always so: powerpc implements transparent hugepages in a different way from hugetlbfs pages, so it's coincidence when their sizes are the same; and x86 and others can support more than one hugetlbfs page size. Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP size in bytes - it's the same for Anonymous and Shmem hugepages. Call it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in case some transparent support for pud and pgd pages is added later. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052200290.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: add three more cond_resched() in swapoffHugh Dickins1-7/+6
Add a cond_resched() in the unuse_pmd_range() loop (so as to call it even when pmd none or trans_huge, like zap_pmd_range() does); and in the unuse_mm() loop (since that might skip over many vmas). shmem_unuse() and radix_tree_locate_item() look good enough already. Those were the obvious places, but in fact the stalls came from find_next_to_unuse(), which sometimes scans through many unused entries. Apply scan_swap_map()'s LATENCY_LIMIT of 256 there too; and only go off to test frontswap_map when a used entry is found. Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052155140.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, page_alloc: keep pcp count and list contents in sync if struct page is ↵Mel Gorman1-2/+10
corrupted Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") will allow the per-cpu list counter to be out of sync with the per-cpu list contents if a struct page is corrupted. The consequence is an infinite loop if the per-cpu lists get fully drained by free_pcppages_bulk because all the lists are empty but the count is positive. The infinite loop occurs here do { batch_free++; if (++migratetype == MIGRATE_PCPTYPES) migratetype = 0; list = &pcp->lists[migratetype]; } while (list_empty(list)); What the user sees is a bad page warning followed by a soft lockup with interrupts disabled in free_pcppages_bulk(). This patch keeps the accounting in sync. Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, rmap: handle anon_vma_prepare() common case inlineVlastimil Babka1-35/+34
anon_vma_prepare() is mostly a large "if (unlikely(...))" block, as the expected common case is that an anon_vma already exists. We could turn the condition around and return 0, but it also makes sense to do it inline and avoid a call for the common case. Bloat-o-meter naturally shows that inlining the check has some code size costs: add/remove: 1/1 grow/shrink: 4/0 up/down: 475/-373 (102) function old new delta __anon_vma_prepare - 359 +359 handle_mm_fault 2744 2796 +52 hugetlb_cow 1146 1170 +24 hugetlb_fault 2123 2145 +22 wp_page_copy 1469 1487 +18 anon_vma_prepare 373 - -373 Checking the asm however confirms that the hot paths now avoid a call, which is moved away. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161116074005.22768-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, debug: print raw struct page data in __dump_page()Vlastimil Babka1-0/+4
__dump_page() is used when a page metadata inconsistency is detected, either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM builds. It prints some of the relevant metadata, but not the whole struct page, which is based on unions and interpretation is dependent on the context. This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field, which is however not printed by __dump_page() and the resulting bug report may then lack clues that could help in determining the root cause. This patch solves the problem by simply printing the whole struct page word by word, so no part is missing, but the interpretation of the data is left to developers. This is similar to e.g. x86_64 raw stack dumps. Example output: page:ffffea00000475c0 count:1 mapcount:0 mapping: (null) index:0x0 flags: 0x100000000000400(reserved) raw: 0100000000000400 0000000000000000 0000000000000000 00000001ffffffff raw: ffffea00000475e0 ffffea00000475e0 0000000000000000 0000000000000000 page dumped because: VM_BUG_ON_PAGE(1) [aryabinin@virtuozzo.com: suggested print_hex_dump()] Link: http://lkml.kernel.org/r/2ff83214-70fe-741e-bf05-fe4a4073ec3e@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: THP page cache support for ppc64Aneesh Kumar K.V4-17/+87
Add arch specific callback in the generic THP page cache code that will deposit and withdarw preallocated page table. Archs like ppc64 use this preallocated table to store the hash pte slot information. Testing: kernel build of the patch series on tmpfs mounted with option huge=always The related thp stat: thp_fault_alloc 72939 thp_fault_fallback 60547 thp_collapse_alloc 603 thp_collapse_alloc_failed 0 thp_file_alloc 253763 thp_file_mapped 4251 thp_split_page 51518 thp_split_page_failed 1 thp_deferred_split_page 73566 thp_split_pmd 665 thp_zero_page_alloc 3 thp_zero_page_alloc_failed 0 [akpm@linux-foundation.org: remove unneeded parentheses, per Kirill] Link: http://lkml.kernel.org/r/20161113150025.17942-2-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: move vma_is_anonymous check within pmd_move_must_withdrawAneesh Kumar K.V1-2/+16
Independent of whether the vma is for anonymous memory, some arches like ppc64 would like to override pmd_move_must_withdraw(). One option is to encapsulate the vma_is_anonymous() check for general architectures inside pmd_move_must_withdraw() so that is always called and architectures that need unconditional overriding can override this function. ppc64 needs to override the function when the MMU is configured to use hash PTE's. [bsingharora@gmail.com: reworked changelog] Link: http://lkml.kernel.org/r/20161113150025.17942-1-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Neuling <mikey@neuling.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: add preempt points into __purge_vmap_area_lazy()Joel Fernandes1-5/+9
Use cond_resched_lock to avoid holding the vmap_area_lock for a potentially long time and thus creating bad latencies for various workloads. [hch: split from a larger patch by Joel, wrote the crappy changelog] Link: http://lkml.kernel.org/r/1479474236-4139-11-git-send-email-hch@lst.de Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: turn vmap_purge_lock into a mutexChristoph Hellwig1-7/+7
The purge_lock spinlock causes high latencies with non RT kernel. This has been reported multiple times on lkml [1] [2] and affects applications like audio. This patch replaces it with a mutex to allow preemption while holding the lock. Thanks to Joel Fernandes for the detailed report and analysis as well as an earlier attempt at fixing this issue. [1] http://lists.openwall.net/linux-kernel/2016/03/23/29 [2] https://lkml.org/lkml/2016/10/9/59 Link: http://lkml.kernel.org/r/1479474236-4139-10-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: mark all calls into the vmalloc subsystem as potentially sleepingChristoph Hellwig1-1/+6
We will take a sleeping lock in later in this series, so this adds the proper safeguards. Link: http://lkml.kernel.org/r/1479474236-4139-9-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: add vfree_atomic()Andrey Ryabinin1-6/+36
We are going to use sleeping lock for freeing vmap. However some vfree() users want to free memory from atomic (but not from interrupt) context. For this we add vfree_atomic() - deferred variation of vfree() which can be used in any atomic context (except NMIs). [akpm@linux-foundation.org: tweak comment grammar] [aryabinin@virtuozzo.com: use raw_cpu_ptr() instead of this_cpu_ptr()] Link: http://lkml.kernel.org/r/1481553981-3856-1-git-send-email-aryabinin@virtuozzo.com Link: http://lkml.kernel.org/r/1479474236-4139-5-git-send-email-hch@lst.de Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Joel Fernandes <joelaf@google.com> Cc: Jisheng Zhang <jszhang@marvell.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: refactor __purge_vmap_area_lazy()Christoph Hellwig1-45/+35
Move the purge_lock synchronization to the callers, move the call to purge_fragmented_blocks_allcpus at the beginning of the function to the callers that need it, move the force_flush behavior to the caller that needs it, and pass start and end by value instead of by reference. No change in behavior. Link: http://lkml.kernel.org/r/1479474236-4139-4-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: remove free_unmap_vmap_area_addr()Christoph Hellwig1-13/+8
Just inline it into the only caller. Link: http://lkml.kernel.org/r/1479474236-4139-3-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: remove free_unmap_vmap_area_noflush()Christoph Hellwig1-11/+2
Patch series "reduce latency in __purge_vmap_area_lazy", v2. This patch (of 10): Sort out the long lock hold times in __purge_vmap_area_lazy. It is based on a patch from Joel. Inline free_unmap_vmap_area_noflush() it into the only caller. Link: http://lkml.kernel.org/r/1479474236-4139-2-git-send-email-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Jisheng Zhang <jszhang@marvell.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: John Dias <joaodias@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: workingset: update shadow limit to reflect bigger active listJohannes Weiner1-19/+25
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") the size of the active file list is no longer limited to half of memory. Increase the shadow node limit accordingly to avoid throwing out shadow entries that might still result in eligible refaults. The exact size of the active list now depends on the overall size of the page cache, but converges toward taking up most of the space: In mm/vmscan.c::inactive_list_is_low(), * total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB It would be possible to apply the same precise ratios when determining the limit for radix tree nodes containing shadow entries, but since it is merely an approximation of the oldest refault distances in the wild and the code also makes assumptions about the node population density, keep it simple and always target the full cache size. While at it, clarify the comment and the formula for memory footprint. Link: http://lkml.kernel.org/r/20161117214701.29000-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: workingset: restore refault tracking for single-page filesJohannes Weiner1-8/+1
Shadow entries in the page cache used to be accounted behind the radix tree implementation's back in the upper bits of node->count, and the radix tree code extending a single-entry tree with a shadow entry in root->rnode would corrupt that counter. As a result, we could not put shadow entries at index 0 if the tree didn't have any other entries, and that means no refault detection for any single-page file. Now that the shadow entries are tracked natively in the radix tree's exceptional counter, this is no longer necessary. Extending and shrinking the tree from and to single entries in root->rnode now does the right thing when the entry is exceptional, remove that limitation. Link: http://lkml.kernel.org/r/20161117193244.GF23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: workingset: move shadow entry tracking to radix tree exceptional trackingJohannes Weiner3-80/+51
Currently, we track the shadow entries in the page cache in the upper bits of the radix_tree_node->count, behind the back of the radix tree implementation. Because the radix tree code has no awareness of them, we rely on random subtleties throughout the implementation (such as the node->count != 1 check in the shrinking code, which is meant to exclude multi-entry nodes but also happens to skip nodes with only one shadow entry, as that's accounted in the upper bits). This is error prone and has, in fact, caused the bug fixed in d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). To remove these subtleties, this patch moves shadow entry tracking from the upper bits of node->count to the existing counter for exceptional entries. node->count goes back to being a simple counter of valid entries in the tree node and can be shrunk to a single byte. This vastly simplifies the page cache code. All accounting happens natively inside the radix tree implementation, and maintaining the LRU linkage of shadow nodes is consolidated into a single function in the workingset code that is called for leaf nodes affected by a change in the page cache tree. This also removes the last user of the __radix_delete_node() return value. Eliminate it. Link: http://lkml.kernel.org/r/20161117193211.GE23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13lib: radix-tree: update callback for changing leaf nodesJohannes Weiner1-1/+2
Support handing __radix_tree_replace() a callback that gets invoked for all leaf nodes that change or get freed as a result of the slot replacement, to assist users tracking nodes with node->private_list. This prepares for putting page cache shadow entries into the radix tree root again and drastically simplifying the shadow tracking. Link: http://lkml.kernel.org/r/20161117193134.GD23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13lib: radix-tree: check accounting of existing slot replacement usersJohannes Weiner4-7/+8
The bug in khugepaged fixed earlier in this series shows that radix tree slot replacement is fragile; and it will become more so when not only NULL<->!NULL transitions need to be caught but transitions from and to exceptional entries as well. We need checks. Re-implement radix_tree_replace_slot() on top of the sanity-checked __radix_tree_replace(). This requires existing callers to also pass the radix tree root, but it'll warn us when somebody replaces slots with contents that need proper accounting (transitions between NULL entries, real entries, exceptional entries) and where a replacement through the slot pointer would corrupt the radix tree node counts. Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13lib: radix-tree: native accounting of exceptional entriesJohannes Weiner1-4/+4
The way the page cache is sneaking shadow entries of evicted pages into the radix tree past the node entry accounting and tracking them manually in the upper bits of node->count is fraught with problems. These shadow entries are marked in the tree as exceptional entries, which are a native concept to the radix tree. Maintain an explicit counter of exceptional entries in the radix tree node. Subsequent patches will switch shadow entry tracking over to that counter. DAX and shmem are the other users of exceptional entries. Since slot replacements that change the entry type from regular to exceptional must now be accounted, introduce a __radix_tree_replace() function that does replacement and accounting, and switch DAX and shmem over. The increase in radix tree node size is temporary. A followup patch switches the shadow tracking to this new scheme and we'll no longer need the upper bits in node->count and shrink that back to one byte. Link: http://lkml.kernel.org/r/20161117192945.GA23430@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: workingset: turn shadow node shrinker bugs into warningsJohannes Weiner1-8/+12
When the shadow page shrinker tries to reclaim a radix tree node but finds it in an unexpected state - it should contain no pages, and non-zero shadow entries - there is no need to kill the executing task or even the entire system. Warn about the invalid state, then leave that tree node be. Simply don't put it back on the shadow LRU for future reclaim and move on. Link: http://lkml.kernel.org/r/20161117191138.22769-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: khugepaged: fix radix tree node leak in shmem collapse error pathJohannes Weiner1-2/+4
The radix tree counts valid entries in each tree node. Entries stored in the tree cannot be removed by simpling storing NULL in the slot or the internal counters will be off and the node never gets freed again. When collapsing a shmem page fails, restore the holes that were filled with radix_tree_insert() with a proper radix tree deletion. Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Link: http://lkml.kernel.org/r/20161117191138.22769-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Jan Kara <jack@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: khugepaged: close use-after-free race during shmem collapsingJohannes Weiner1-0/+5
Patch series "mm: workingset: radix tree subtleties & single-page file refaults", v3. This is another revision of the radix tree / workingset patches based on feedback from Jan and Kirill. This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow entries without radix tree node"). That patch fixed an issue that was caused mainly by the page cache sneaking special shadow page entries into the radix tree and relying on subtleties in the radix tree code to make that work. The fix also had to stop tracking refaults for single-page files because shadow pages stored as direct pointers in radix_tree_root->rnode weren't properly handled during tree extension. These patches make the radix tree code explicitely support and track such special entries, to eliminate the subtleties and to restore the thrash detection for single-page files. This patch (of 9): When a radix tree iteration drops the tree lock, another thread might swoop in and free the node holding the current slot. The iteration needs to do another tree lookup from the current index to continue. [kirill.shutemov@linux.intel.com: re-lookup for replacement] Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Link: http://lkml.kernel.org/r/20161117191138.22769-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: don't cap request size based on read-ahead settingJens Axboe1-11/+28
We ran into a funky issue, where someone doing 256K buffered reads saw 128K requests at the device level. Turns out it is read-ahead capping the request size, since we use 128K as the default setting. This doesn't make a lot of sense - if someone is issuing 256K reads, they should see 256K reads, regardless of the read-ahead setting, if the underlying device can support a 256K read in a single command. This patch introduces a bdi hint, io_pages. This is the soft max IO size for the lower level, I've hooked it up to the bdev settings here. Read-ahead is modified to issue the maximum of the user request size, and the read-ahead max size, but capped to the max request size on the device side. The latter is done to avoid reading ahead too much, if the application asks for a huge read. With this patch, the kernel behaves like the application expects. Link: http://lkml.kernel.org/r/1479498073-8657-1-git-send-email-axboe@fb.com Signed-off-by: Jens Axboe <axboe@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13shmem: fix compilation warnings on unused functionsJérémy Lefaure1-0/+2
Compiling shmem.c with SHMEM and TRANSAPRENT_HUGE_PAGECACHE enabled raises warnings on two unused functions when CONFIG_TMPFS and CONFIG_SYSFS are both disabled: mm/shmem.c:390:20: warning: `shmem_format_huge' defined but not used [-Wunused-function] static const char *shmem_format_huge(int huge) ^~~~~~~~~~~~~~~~~ mm/shmem.c:373:12: warning: `shmem_parse_huge' defined but not used [-Wunused-function] static int shmem_parse_huge(const char *str) ^~~~~~~~~~~~~~~~ A conditional compilation on tmpfs or sysfs removes the warnings. Link: http://lkml.kernel.org/r/20161118055749.11313-1-jeremy.lefaure@lse.epita.fr Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm/filemap.c: add comment for confusing logic in page_cache_tree_insert()Kirill A. Shutemov1-1/+4
Unlike THP, hugetlb pages are represented by one entry in the radix-tree. [akpm@linux-foundation.org: tweak comment] Link: http://lkml.kernel.org/r/20161110163640.126124-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm/pkeys: generate pkey system call code only if ARCH_HAS_PKEYS is selectedHeiko Carstens1-0/+4
Having code for the pkey_mprotect, pkey_alloc and pkey_free system calls makes only sense if ARCH_HAS_PKEYS is selected. If not selected these system calls will always return -ENOSPC or -EINVAL. To simplify things and have less code generate the pkey system call code only if ARCH_HAS_PKEYS is selected. For architectures which have already wired up the system calls, but do not select ARCH_HAS_PKEYS this will result in less generated code and a different return code: the three system calls will now always return -ENOSYS, using the cond_syscall mechanism. For architectures which have not wired up the system calls less unreachable code will be generated. Link: http://lkml.kernel.org/r/20161114111251.70084-1-heiko.carstens@de.ibm.com Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13of/fdt: mark hotpluggable memoryReza Arbab1-1/+1
When movable nodes are enabled, any node containing only hotpluggable memory is made movable at boot time. On x86, hotpluggable memory is discovered by parsing the ACPI SRAT, making corresponding calls to memblock_mark_hotplug(). If we introduce a dt property to describe memory as hotpluggable, configs supporting early fdt may then also do this marking and use movable nodes. Link: http://lkml.kernel.org/r/1479160961-25840-5-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alistair Popple <apopple@au1.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Stewart Smith <stewart@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: enable CONFIG_MOVABLE_NODE on non-x86 archesReza Arbab1-1/+1
To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of the following must be true: 1. This config has the capability to identify movable nodes at boot. Right now, only x86 can do this. 2. Our config supports memory hotplug, which means that a movable node can be created by hotplugging all of its memory into ZONE_MOVABLE. Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently recognizes (1), but not (2). Link: http://lkml.kernel.org/r/1479160961-25840-4-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alistair Popple <apopple@au1.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Stewart Smith <stewart@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: remove x86-only restriction of movable_nodeReza Arbab1-20/+0
In commit c5320926e370 ("mem-hotplug: introduce movable_node boot option"), the memblock allocation direction is changed to bottom-up and then back to top-down like this: 1. memblock_set_bottom_up(true), called by cmdline_parse_movable_node(). 2. memblock_set_bottom_up(false), called by x86's numa_init(). Even though (1) occurs in generic mm code, it is wrapped by #ifdef CONFIG_MOVABLE_NODE, which depends on X86_64. This means that when we extend CONFIG_MOVABLE_NODE to non-x86 arches, things will be unbalanced. (1) will happen for them, but (2) will not. This toggle was added in the first place because x86 has a delay between adding memblocks and marking them as hotpluggable. Since other arches do this marking either immediately or not at all, they do not require the bottom-up toggle. So, resolve things by moving (1) from cmdline_parse_movable_node() to x86's setup_arch(), immediately after the movable_node parameter has been parsed. Link: http://lkml.kernel.org/r/1479160961-25840-3-git-send-email-arbab@linux.vnet.ibm.com Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alistair Popple <apopple@au1.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Stewart Smith <stewart@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm/mempolicy.c: forbid static or relative flags for local NUMA modePiotr Kwapulinski1-1/+3
The MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES flags are irrelevant when setting them for MPOL_LOCAL NUMA memory policy via set_mempolicy or mbind. Return the "invalid argument" from set_mempolicy and mbind whenever any of these flags is passed along with MPOL_LOCAL. It is consistent with MPOL_PREFERRED passed with empty nodemask. It slightly shortens the execution time in paths where these flags are used e.g. when trying to rebind the NUMA nodes for changes in cgroups cpuset mems (mpol_rebind_preferred()) or when just printing the mempolicy structure (/proc/PID/numa_maps). Isolated tests done. Link: http://lkml.kernel.org/r/20161027163037.4089-1-kwapulinski.piotr@gmail.com Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Liang Chen <liangchen.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: fix up get_user_pages* commentsLorenzo Stoakes1-10/+6
In the previous round of get_user_pages* changes comments attached to __get_user_pages_unlocked() and get_user_pages_unlocked() were rendered incorrect, this patch corrects them. In addition the get_user_pages_unlocked() comment seems to have already been outdated as it referred to tsk, mm parameters which were removed in c12d2da5 ("mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs"), this patch fixes this also. Link: http://lkml.kernel.org/r/20161025233435.5338-1-lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: remove the page size change check in tlb_remove_pageAneesh Kumar K.V1-15/+6
Now that we check for page size change early in the loop, we can partially revert e9d55e157034a ("mm: change the interface for __tlb_remove_page"). This simplies the code much, by removing the need to track the last address with which we adjusted the range. We also go back to the older way of filling the mmu_gather array, ie, we add an entry and then check whether the gather batch is full. Link: http://lkml.kernel.org/r/20161026084839.27299-6-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: add tlb_remove_check_page_size_change to track page size changeAneesh Kumar K.V4-1/+16
With commit e77b0852b551 ("mm/mmu_gather: track page size with mmu gather and force flush if page size change") we added the ability to force a tlb flush when the page size change in a mmu_gather loop. We did that by checking for a page size change every time we added a page to mmu_gather for lazy flush/remove. We can improve that by moving the page size change check early and not doing it every time we add a page. This also helps us to do tlb flush when invalidating a range covering dax mapping. Wrt dax mapping we don't have a backing struct page and hence we don't call tlb_remove_page, which earlier forced the tlb flush on page size change. Moving the page size change check earlier means we will do the same even for dax mapping. We also avoid doing this check on architecture other than powerpc. In a later patch we will remove page size check from tlb_remove_page(). Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm/hugetlb: add tlb_remove_hugetlb_entry for handling hugetlb pagesAneesh Kumar K.V1-1/+1
This add tlb_remove_hugetlb_entry similar to tlb_remove_pmd_tlb_entry. Link: http://lkml.kernel.org/r/20161026084839.27299-4-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: use the correct page size when removing the pageAneesh Kumar K.V1-2/+2
We are removing a pmd hugepage here. Use the correct page size. Link: http://lkml.kernel.org/r/20161026084839.27299-2-aneesh.kumar@linux.vnet.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13shmem: avoid maybe-uninitialized warningArnd Bergmann1-3/+1
After enabling -Wmaybe-uninitialized warnings, we get a false-postive warning for shmem: mm/shmem.c: In function `shmem_getpage_gfp': include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized] This can be easily avoided, since the correct 'info' pointer is known at the time we first enter the function, so we can simply move the initialization up. Moving it before the first label avoids the warning and lets us remove two later initializations. Note that the function is so hard to read that it not only confuses the compiler, but also most readers and without this patch it could\ easily break if one of the 'goto's changed. Link: https://www.spinics.net/lists/kernel/msg2368133.html Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, compaction: fix NR_ISOLATED_* stats for pfn based migrationMing Ling2-26/+14
Since commit bda807d44454 ("mm: migrate: support non-lru movable page migration") isolate_migratepages_block) can isolate !PageLRU pages which would acct_isolated account as NR_ISOLATED_*. Accounting these non-lru pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide heuristics based on those counters such as pgdat_reclaimable_pages resp. too_many_isolated which would lead to unexpected stalls during the direct reclaim without any good reason. Note that __alloc_contig_migrate_range can isolate a lot of pages at once. On mobile devices such as 512M ram android Phone, it may use a big zram swap. In some cases zram(zsmalloc) uses too many non-lru but migratedable pages, such as: MemTotal: 468148 kB Normal free:5620kB Free swap:4736kB Total swap:409596kB ZRAM: 164616kB(zsmalloc non-lru pages) active_anon:60700kB inactive_anon:60744kB active_file:34420kB inactive_file:37532kB Fix this by only accounting lru pages to NR_ISOLATED_* in isolate_migratepages_block right after they were isolated and we still know they were on LRU. Drop acct_isolated because it is called after the fact and we've lost that information. Batching per-cpu counter doesn't make much improvement anyway. Also make sure that we uncharge only LRU pages when putting them back on the LRU in putback_movable_pages resp. when unmap_and_move migrates the page. [mhocko@suse.com: replace acct_isolated() with direct counting] Fixes: bda807d44454 ("mm: migrate: support non-lru movable page migration") Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.org Signed-off-by: Ming Ling <ming.ling@spreadtrum.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelistMichal Hocko1-16/+8
__GFP_THISNODE is documented to enforce the allocation to be satisified from the requested node with no fallbacks or placement policy enforcements. policy_zonelist seemingly breaks this semantic if the current policy is MPOL_MBIND and instead of taking the node it will fallback to the first node in the mask if the requested one is not in the mask. This is confusing to say the least because it fact we shouldn't ever go that path. First tasks shouldn't be scheduled on CPUs with nodes outside of their mempolicy binding. And secondly policy_zonelist is called only from 3 places: - huge_zonelist - never should do __GFP_THISNODE when going this path - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either - alloc_pages_current - which uses default_policy id __GFP_THISNODE is used So we shouldn't even need to care about this possibility and can drop the confusing code. Let's keep a WARN_ON_ONCE in place to catch potential users and fix them up properly (aka use a different allocation function which ignores mempolicy). [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm, thp: avoid unlikely branches for split_huge_pmdDavid Rientjes3-4/+4
While doing MADV_DONTNEED on a large area of thp memory, I noticed we encountered many unlikely() branches in profiles for each backing hugepage. This is because zap_pmd_range() would call split_huge_pmd(), which rechecked the conditions that were already validated, but as part of an unlikely() branch. Avoid the unlikely() branch when in a context where pmd is known to be good for __split_huge_pmd() directly. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm/vmalloc.c: simplify /proc/vmallocinfo implementationzijun_hu1-22/+5
Many seq_file helpers exist for simplifying implementation of virtual files especially, for /proc nodes. however, the helpers for iteration over list_head are available but aren't adopted to implement /proc/vmallocinfo currently. Simplify /proc/vmallocinfo implementation by using existing seq_file helpers. Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.com Signed-off-by: zijun_hu <zijun_hu@htc.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: make unreserve highatomic functions reliableMinchan Kim1-7/+17
Currently, unreserve_highatomic_pageblock bails out if it found highatomic pageblock regardless of really moving free pages from the one so that it could mitigate unreserve logic's goal which saves OOM of a process. This patch makes unreserve functions bail out only if it moves some pages out of !highatomic free list to avoid such false positive. Another potential problem is that by race between page freeing and reserve highatomic function, pages could be in highatomic free list even though the pageblock is !high atomic migratetype. In that case, unreserve_highatomic_pageblock can be void if count of highatomic reserve is less than pageblock_nr_pages. We could solve it simply via draining all of reserved pages before the OOM. It would have a safeguard role to exhuast reserved pages before converging to OOM. Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-13mm: try to exhaust highatomic reserve before the OOMMinchan Kim1-5/+10
I got OOM report from production team with v4.4 kernel. It had enough free memory but failed to allocate GFP_KERNEL order-0 page and finally encountered OOM kill. It occured during QA process which launches several apps, switching and so on. It happned rarely. IOW, In normal situation, it was not a problem but if we are unluck so that several apps uses peak memory at the same time, it can happen. If we manage to pass the phase, the system can go working well. I could reproduce it with my test(memory spike easily. Look at below. The reason is free pages(19M) of DMA32 zone are reserved for HIGHORDERATOMIC and doesn't unreserved before the OOM. balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 balloon cpuset=/ mems_allowed=0 CPU: 1 PID: 8473 Comm: balloon Tainted: G W OE 4.8.0-rc7-00219-g3f74c9559583-dirty #3161 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 dump_header+0x5c/0x1ce oom_kill_process+0x22e/0x400 out_of_memory+0x1ac/0x210 __alloc_pages_nodemask+0x101e/0x1040 handle_mm_fault+0xa0a/0xbf0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:383949 inactive_anon:106724 isolated_anon:0 active_file:15 inactive_file:44 isolated_file:0 unevictable:0 dirty:0 writeback:24 unstable:0 slab_reclaimable:2483 slab_unreclaimable:3326 mapped:0 shmem:0 pagetables:1906 bounce:0 free:6898 free_pcp:291 free_cma:0 Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB 51131 total pagecache pages 50795 pages in swap cache Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228 Free swap = 8kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12658 pages reserved 0 pages cma reserved 0 pages hwpoisoned Another example exceeded the limit by the race is in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK) CPU: 0 PID: 476 Comm: in:imklog Tainted: G E 4.8.0-rc7-00217-g266ef83c51e5-dirty #3135 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x63/0x90 warn_alloc_failed+0xdb/0x130 __alloc_pages_nodemask+0x4d6/0xdb0 new_slab+0x339/0x490 ___slab_alloc.constprop.74+0x367/0x480 __slab_alloc.constprop.73+0x20/0x40 __kmalloc+0x1a4/0x1e0 alloc_indirect.isra.14+0x1d/0x50 virtqueue_add_sgs+0x1c4/0x470 __virtblk_add_req+0xae/0x1f0 virtio_queue_rq+0x12d/0x290 __blk_mq_run_hw_queue+0x239/0x370 blk_mq_run_hw_queue+0x8f/0xb0 blk_mq_insert_requests+0x18c/0x1a0 blk_mq_flush_plug_list+0x125/0x140 blk_flush_plug_list+0xc7/0x220 blk_finish_plug+0x2c/0x40 __do_page_cache_readahead+0x196/0x230 filemap_fault+0x448/0x4f0 ext4_filemap_fault+0x36/0x50 __do_fault+0x75/0x140 handle_mm_fault+0x84d/0xbe0 __do_page_fault+0x1dd/0x4d0 trace_do_page_fault+0x43/0x130 do_async_page_fault+0x1a/0xa0 async_page_fault+0x28/0x30 Mem-Info: active_anon:363826 inactive_anon:121283 isolated_anon:32 active_file:65 inactive_file:152 isolated_file:0 unevictable:0 dirty:0 writeback:46 unstable:0 slab_reclaimable:2778 slab_unreclaimable:3070 mapped:112 shmem:0 pagetables:1822 bounce:0 free:9469 free_pcp:231 free_cma:0 Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB lowmem_reserve[]: 0 1952 1952 1952 DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB lowmem_reserve[]: 0 0 0 0 DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB 2775 total pagecache pages 2536 pages in swap cache Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077 Free swap = 108744kB Total swap = 255996kB 524158 pages RAM 0 pages HighMem/MovableOnly 12648 pages reserved 0 pages cma reserved 0 pages hwpoisoned It's weird to show that zone has enough free memory above min watermark but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic pages. As last resort, try to unreserve highatomic pages again and if it has moved pages to non-highatmoc free list, retry reclaim once more. Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>