summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2015-09-09zsmalloc/zram: introduce zs_pool_stats apiSergey Senozhatsky1-14/+15
`zs_compact_control' accounts the number of migrated objects but it has a limited lifespan -- we lose it as soon as zs_compaction() returns back to zram. It worked fine, because (a) zram had it's own counter of migrated objects and (b) only zram could trigger compaction. However, this does not work for automatic pool compaction (not issued by zram). To account objects migrated during auto-compaction (issued by the shrinker) we need to store this number in zs_pool. Define a new `struct zs_pool_stats' structure to keep zs_pool's stats there. It provides only `num_migrated', as of this writing, but it surely can be extended. A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to caller. Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09zsmalloc: cosmetic compaction code adjustmentsSergey Senozhatsky1-6/+6
Change zs_object_copy() argument order to be (DST, SRC) rather than (SRC, DST). copy/move functions usually have (to, from) arguments order. Rename alloc_target_page() to isolate_target_page(). This function doesn't allocate anything, it isolates target page, pretty much like isolate_source_page(). Tweak __zs_compact() comment. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09zsmalloc: introduce zs_can_compact() functionSergey Senozhatsky1-0/+26
This function checks if class compaction will free any pages. Rephrasing -- do we have enough unused objects to form at least one ZS_EMPTY page and free it. It aborts compaction if class compaction will not result in any (further) savings. EXAMPLE (this debug output is not part of this patch set): - class size - number of allocated objects - number of used objects - max objects per zspage - pages per zspage - estimated number of pages that will be freed [..] class-512 objs:544 inuse:540 maxobj-per-zspage:8 pages-per-zspage:1 zspages-to-free:0 ... class-512 compaction is useless. break class-496 objs:660 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:2 class-496 objs:627 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:1 class-496 objs:594 inuse:570 maxobj-per-zspage:33 pages-per-zspage:4 zspages-to-free:0 ... class-496 compaction is useless. break class-448 objs:657 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:4 class-448 objs:648 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:3 class-448 objs:639 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:2 class-448 objs:630 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:1 class-448 objs:621 inuse:617 maxobj-per-zspage:9 pages-per-zspage:1 zspages-to-free:0 ... class-448 compaction is useless. break class-432 objs:728 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:1 class-432 objs:700 inuse:685 maxobj-per-zspage:28 pages-per-zspage:3 zspages-to-free:0 ... class-432 compaction is useless. break class-416 objs:819 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:2 class-416 objs:780 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:1 class-416 objs:741 inuse:705 maxobj-per-zspage:39 pages-per-zspage:4 zspages-to-free:0 ... class-416 compaction is useless. break class-400 objs:690 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:1 class-400 objs:680 inuse:674 maxobj-per-zspage:10 pages-per-zspage:1 zspages-to-free:0 ... class-400 compaction is useless. break class-384 objs:736 inuse:709 maxobj-per-zspage:32 pages-per-zspage:3 zspages-to-free:0 ... class-384 compaction is useless. break [..] Every "compaction is useless" indicates that we saved CPU cycles. class-512 has 544 object allocated 540 objects used 8 objects per-page Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to migrate all of its objects and free this zspage; so compaction will not make a lot of sense, it's better to just leave it as is. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09zsmalloc: always keep per-class statsSergey Senozhatsky1-32/+8
Always account per-class `zs_size_stat' stats. This data will help us make better decisions during compaction. We are especially interested in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction will result in any memory gain. For instance, we know the number of allocated objects in the class, the number of objects being used (so we also know how many objects are not used) and the number of objects per-page. So we can ensure if we have enough unused objects to form at least one ZS_EMPTY zspage during compaction. We calculate this value on per-class basis so we can calculate a total number of zspages that can be released. Which is exactly what a shrinker wants to know. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09zsmalloc: drop unused variable `nr_to_migrate'Sergey Senozhatsky1-4/+0
This patchset tweaks compaction and makes it possible to trigger pool compaction automatically when system is getting low on memory. zsmalloc in some cases can suffer from a notable fragmentation and compaction can release some considerable amount of memory. The problem here is that currently we fully rely on user space to perform compaction when needed. However, performing zsmalloc compaction is not always an obvious thing to do. For example, suppose we have a `idle' fragmented (compaction was never performed) zram device and system is getting low on memory due to some 3rd party user processes (gcc LTO, or firefox, etc.). It's quite unlikely that user space will issue zpool compaction in this case. Besides, user space cannot tell for sure how badly pool is fragmented; however, this info is known to zsmalloc and, hence, to a shrinker. This patch (of 7): __zs_compact() does not use `nr_to_migrate', drop it. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/memblock.c: fix comment in __next_mem_range()Alexander Kuleshov1-1/+1
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/page_alloc.c: fix type information of memoryless nodeZhen Lei1-1/+2
For a memoryless node, the output of get_pfn_range_for_nid are all zero. It will display mem from 0 to -1. Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09memory-hotplug: fix comments in zone_spanned_pages_in_node() and ↵Xishi Qiu1-2/+2
zone_spanned_pages_in_node() When hot adding a node from add_memory(), we will add memblock first, so the node is not empty. But when called from cpu_up(), the node should be empty. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>\ Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/page_alloc.c: change sysctl_lower_zone_reserve_ratio to ↵Yaowei Bai1-2/+2
sysctl_lowmem_reserve_ratio in comments We use sysctl_lowmem_reserve_ratio rather than sysctl_lower_zone_reserve_ratio to determine how aggressive the kernel is in defending lowmem from the possibility of being captured into pinned user memory. To avoid misleading, correct it in some comments. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/page_alloc.c: fix a misleading commentYaowei Bai1-1/+1
The comment says that the per-cpu batchsize and zone watermarks are determined by present_pages which is definitely wrong, they are both calculated from managed_pages. Fix it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/mmap.c:insert_vm_struct(): check for failure before setting valuesChen Gang1-6/+7
There's no point in initializing vma->vm_pgoff if the insertion attempt will be failing anyway. Run the checks before performing the initialization. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/khugepaged: allow interruption of allocation sleep againPetr Mladek1-2/+6
Commit 1dfb059b9438 ("thp: reduce khugepaged freezing latency") fixed khugepaged to do not block a system suspend. But the result is that it could not get interrupted before the given timeout because the condition for the wait event is "false". This patch puts back the original approach but it uses freezable_schedule_timeout_interruptible() instead of schedule_timeout_interruptible(). It does the right thing. I am pretty sure that the freezable variant was not used in the original fix only because it was not available at that time. The regression has been there for ages. It was not critical. It just did the allocation throttling a little bit more aggressively. I found this problem when converting the kthread to kthread worker API and trying to understand the code. This bug is thought to have minimal userspace-visible impact. Somebody could set a high alloc_sleep value by mistake, and then try to fix it back, but khugepaged would keep sleeping until the high value expires. Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/memblock.c: fiy typos in commentsAlexander Kuleshov1-4/+4
s/succees/success/ Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/compaction: correct to flush migrated pages if pageblock skip happensJoonsoo Kim2-15/+16
We cache isolate_start_pfn before entering isolate_migratepages(). If pageblock is skipped in isolate_migratepages() due to whatever reason, cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages that were freed. For example, the following scenario can be possible: - assume order-9 compaction, pageblock order is 9 - start_isolate_pfn is 0x200 - isolate_migratepages() - skip a number of pageblocks - start to isolate from pfn 0x600 - cc->migrate_pfn = 0x620 - return - last_migrated_pfn is set to 0x200 - check flushing condition - current_block_start is set to 0x600 - last_migrated_pfn < current_block_start then do useless flush This wrong flush would not help the performance and success rate so this patch tries to fix it. One simple way to know the exact position where we start to isolate migratable pages is that we cache it in isolate_migratepages() before entering actual isolation. This patch implements that and fixes the problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm: rename alloc_pages_exact_node() to __alloc_pages_node()Vlastimil Babka10-15/+13
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, vmscan: unlock page while waiting on writebackHugh Dickins1-2/+5
This is merely a politeness: I've not found that shrink_page_list() leads to deadlock with the page it holds locked across wait_on_page_writeback(); but nevertheless, why hold others off by keeping the page locked there? And while we're at it: remove the mistaken "not " from the commentary on this Case 3 (and a distracting blank line from Case 2, if I may). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09list_lru: don't call list_lru_from_kmem if the list_head is emptyJeff Layton1-2/+2
If the list_head is empty then we'll have called list_lru_from_kmem for nothing. Move that call inside of the list_empty if block. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09kmemleak: record accurate early log buffer count and report when exceededWang Kai1-1/+2
In log_early function, crt_early_log should also count once when 'crt_early_log >= ARRAY_SIZE(early_log)'. Otherwise the reported count from kmemleak_init is one less than 'actual number'. Then, in kmemleak_init, if early_log buffer size equal actual number, kmemleak will init sucessful, so change warning condition to 'crt_early_log > ARRAY_SIZE(early_log)'. Signed-off-by: Wang Kai <morgan.wang@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/mmap.c: simplify the failure return working flowChen Gang1-22/+22
__split_vma() doesn't need out_err label, neither need initializing err. copy_vma() can return NULL directly when kmem_cache_alloc() fails. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09shmem: recalculate file inode when fstatYu Zhao1-0/+16
Shmem uses shmem_recalc_inode to update i_blocks when it allocates page, undoes range or swaps. But mm can drop clean page without notifying shmem. This makes fstat sometimes return out-of-date block size. The problem can be partially solved when we add inode_operations->getattr which calls shmem_recalc_inode to update i_blocks for fstat. shmem_recalc_inode also updates counter used by statfs and vm_committed_as. For them the situation is not changed. They still suffer from the discrepancy after dropping clean page and before the function is called by aforementioned triggers. Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/memblock.c: rename local variable of memblock_type to 'type'Alexander Kuleshov1-5/+5
Since commit e3239ff92a17 ("memblock: Rename memblock_region to memblock_type and memblock_property to memblock_region"), all local variables of the membock_type type were renamed to 'type'. This commit renames all remaining local variables with the memblock_type type to the same view. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: don't try to unpoison containment-failed pagesNaoya Horiguchi1-0/+16
memory_failure() can be called at any page at any time, which means that we can't eliminate the possibility of containment failure. In such case the best option is to leak the page intentionally (and never touch it later.) We have an unpoison function for testing, and it cannot handle such containment-failed pages, which results in kernel panic (visible with various calltraces.) So this patch suggests that we limit the unpoisonable pages to properly contained pages and ignore any other ones. Testers are recommended to keep in mind that there're un-unpoisonable pages when writing test programs. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: fix race between soft_offline_page and unpoison_memoryWanpeng Li2-8/+5
Wanpeng Li reported a race between soft_offline_page() and unpoison_memory(), which causes the following kernel panic: BUG: Bad page state in process bash pfn:97000 page:ffffea00025c0000 count:0 mapcount:1 mapping: (null) index:0x7f4fdbe00 flags: 0x1fffff80080048(uptodate|active|swapbacked) page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set bad because of flags: flags: 0x40(active) Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 Call Trace: dump_stack+0x48/0x5c bad_page+0xe6/0x140 free_pages_prepare+0x2f9/0x320 ? uncharge_list+0xdd/0x100 free_hot_cold_page+0x40/0x170 __put_single_page+0x20/0x30 put_page+0x25/0x40 unmap_and_move+0x1a6/0x1f0 migrate_pages+0x100/0x1d0 ? kill_procs+0x100/0x100 ? unlock_page+0x6f/0x90 __soft_offline_page+0x127/0x2a0 soft_offline_page+0xa6/0x200 This race is explained like below: CPU0 CPU1 soft_offline_page __soft_offline_page TestSetPageHWPoison unpoison_memory PageHWPoison check (true) TestClearPageHWPoison put_page -> release refcount held by get_hwpoison_page in unpoison_memory put_page -> release refcount held by isolate_lru_page in __soft_offline_page migrate_pages The second put_page() releases refcount held by isolate_lru_page() which will lead to unmap_and_move() releases the last refcount of page and w/ mapcount still 1 since try_to_unmap() is not called if there is only one user map the page. Anyway, the page refcount and mapcount will still mess if the page is mapped by multiple users. This race was introduced by commit 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()"), which focuses on preventing the reuse of successfully migrated page. Before this commit we prevent the reuse by changing the migratetype to MIGRATE_ISOLATE during soft offlining, which has the following problems, so simply reverting the commit is not a best option: 1) it doesn't eliminate the reuse completely, because set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the target page if the pageblock of the page contains one or more unmovable pages (i.e. has_unmovable_pages() returns true). 2) the original code changes migratetype to MIGRATE_ISOLATE forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline, regardless of the original migratetype state, which could impact other subsystems like memory hotplug or compaction. This patch moves PageSetHWPoison just after put_page() in unmap_and_move(), which closes up the reported race window and minimizes another race window b/w SetPageHWPoison and reallocation (which causes the reuse of soft-offlined page.) The latter race window still exists but it's acceptable, because it's rare and effectively the same as ordinary "containment failure" case even if it happens, so keep the window open is acceptable. Fixes: 4491f71260 ("mm/memory-failure: set PageHWPoison before migrate_pages()") Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Wanpeng Li <wanpeng.li@hotmail.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: introduce num_poisoned_pages wrappersNaoya Horiguchi1-16/+14
num_poisoned_pages counter will be changed outside mm/memory-failure.c by a subsequent patch, so this patch prepares wrappers to manipulate it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: replace most of put_page in memory error handling by ↵Wanpeng Li1-17/+15
put_hwpoison_page Replace most instances of put_page() in memory error handling with put_hwpoison_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: fix refcount of THP head page in no-injection caseWanpeng Li1-1/+1
Hwpoison injection takes a refcount of target page and another refcount of head page of THP if the target page is the tail page of a THP. However, current code doesn't release the refcount of head page if the THP is not supported to be injected wrt hwpoison filter. Fix it by reducing the refcount of head page if the target page is the tail page of a THP and it is not supported to be injected. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error ↵Wanpeng Li1-0/+21
handling Introduce put_hwpoison_page to put refcount for memory error handling. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: fix PageHWPoison test/set raceWanpeng Li1-0/+2
There is a race between madvise_hwpoison path and memory_failure: CPU0 CPU1 madvise_hwpoison get_user_pages_fast PageHWPoison check (false) memory_failure TestSetPageHWPoison soft_offline_page PageHWPoison check (true) return -EBUSY (without put_page) Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hwpoison: fix failure to split thp w/ refcount heldWanpeng Li1-0/+2
THP pages will get a refcount in madvise_hwpoison() w/ MF_COUNT_INCREASED flag, however, the refcount is still held when fail to split THP pages. Fix it by reducing the refcount of THP pages when fail to split THP. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm: add utility for early copy from unmapped ramMark Salter1-0/+22
When booting an arm64 kernel w/initrd using UEFI/grub, use of mem= will likely cut off part or all of the initrd. This leaves it outside the kernel linear map which leads to failure when unpacking. The x86 code has a similar need to relocate an initrd outside of mapped memory in some cases. The current x86 code uses early_memremap() to copy the original initrd from unmapped to mapped RAM. This patchset creates a generic copy_from_early_mem() utility based on that x86 code and has arm64 and x86 share it in their respective initrd relocation code. This patch (of 3): In some early boot circumstances, it may be necessary to copy from RAM outside the kernel linear mapping to mapped RAM. The need to relocate an initrd is one example in the x86 code. This patch creates a helper function based on current x86 code. Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, compaction: skip compound pages by order in free scannerVlastimil Babka1-0/+25
The compaction free scanner is looking for PageBuddy() pages and skipping all others. For large compound pages such as THP or hugetlbfs, we can save a lot of iterations if we skip them at once using their compound_order(). This is generally unsafe and we can read a bogus value of order due to a race, but if we are careful, the only danger is skipping too much. When tested with stress-highalloc from mmtests on 4GB system with 1GB hugetlbfs pages, the vmstat compact_free_scanned count decreased by at least 15%. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, compaction: always skip all compound pages by order in migrate scannerVlastimil Babka1-20/+27
The compaction migrate scanner tries to skip THP pages by their order, to reduce number of iterations for pages it cannot isolate. The check is only done if PageLRU() is true, which means it applies to THP pages, but not e.g. hugetlbfs pages or any other non-LRU compound pages, which we have to iterate by base pages. This limitation comes from the assumption that it's only safe to read compound_order() when we have the zone's lru_lock and THP cannot be split under us. But the only danger (after filtering out order values that are not below MAX_ORDER, to prevent overflows) is that we skip too much or too little after reading a bogus compound_order() due to a rare race. This is the same reasoning as patch 99c0fd5e51c4 ("mm, compaction: skip buddy pages by their order in the migrate scanner") introduced for unsafely reading PageBuddy() order. After this patch, all pages are tested for PageCompound() and we skip them by compound_order(). The test is done after the test for balloon_page_movable() as we don't want to assume if balloon pages (or other pages with own isolation and migration implementation if a generic API gets implemented) are compound or not. When tested with stress-highalloc from mmtests on 4GB system with 1GB hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by 15%. [kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, compaction: encapsulate resetting cached scanner positionsVlastimil Babka1-6/+10
Reseting the cached compaction scanner positions is now open-coded in __reset_isolation_suitable() and compact_finished(). Encapsulate the functionality in a new function reset_cached_positions(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, compaction: simplify handling restart position in free pages scannerVlastimil Babka1-15/+20
Handling the position where compaction free scanner should restart (stored in cc->free_pfn) got more complex with commit e14c720efdd7 ("mm, compaction: remember position within pageblock in free pages scanner"). Currently the position is updated in each loop iteration of isolate_freepages(), although it should be enough to update it only when breaking from the loop. There's also an extra check outside the loop updates the position in case we have met the migration scanner. This can be simplified if we move the test for having isolated enough from the for-loop header next to the test for contention, and determining the restart position only in these cases. We can reuse the isolate_start_pfn variable for this instead of setting cc->free_pfn directly. Outside the loop, we can simply set cc->free_pfn to current value of isolate_start_pfn without any extra check. Also add a VM_BUG_ON to catch possible mistake in the future, in case we later add a new condition that terminates isolate_freepages_block() prematurely without also considering the condition in isolate_freepages(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, compaction: more robust check for scanners meetingVlastimil Babka1-8/+14
Assorted compaction cleanups and optimizations. The interesting patches are 4 and 5. In 4, skipping of compound pages in single iteration is improved for migration scanner, so it works also for !PageLRU compound pages such as hugetlbfs, slab etc. Patch 5 introduces this kind of skipping in the free scanner. The trick is that we can read compound_order() without any protection, if we are careful to filter out values larger than MAX_ORDER. The only danger is that we skip too much. The same trick was already used for reading the freepage order in the migrate scanner. To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc from mmtests, set to simulate THP allocations (including __GFP_COMP) on a 4GB system where 1GB was occupied by hugetlbfs pages. I'll include just the relevant stats: Patch 3 Patch 4 Patch 5 Compaction stalls 7523 7529 7515 Compaction success 323 304 322 Compaction failures 7200 7224 7192 Page migrate success 247778 264395 240737 Page migrate failure 15358 33184 21621 Compaction pages isolated 906928 980192 909983 Compaction migrate scanned 2005277 1692805 1498800 Compaction free scanned 13255284 11539986 9011276 Compaction cost 288 305 277 With 5 iterations per patch, the results are still noisy, but we can see that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the hugetlbfs pages at once. Interestingly, free_scanned is also reduced and I have no idea why. Patch 5 further reduces free_scanned as expected, by 15%. Other stats are unaffected modulo noise. [1] https://lkml.org/lkml/2015/1/19/158 This patch (of 5): Compaction should finish when the migration and free scanner meet, i.e. they reach the same pageblock. Currently however, the test in compact_finished() simply just compares the exact pfns, which may yield a false negative when the free scanner position is in the middle of a pageblock and the migration scanner reaches the begining of the same pageblock. This hasn't been a problem until commit e14c720efdd7 ("mm, compaction: remember position within pageblock in free pages scanner") allowed the free scanner position to be in the middle of a pageblock between invocations. The hot-fix 1d5bfe1ffb5b ("mm, compaction: prevent infinite loop in compact_zone") prevented the issue by adding a special check in the migration scanner to satisfy the current detection of scanners meeting. However, the proper fix is to make the detection more robust. This patch introduces the compact_scanners_met() function that returns true when the free scanner position is in the same or lower pageblock than the migration scanner. The special case in isolate_migratepages() introduced by 1d5bfe1ffb5b is removed. Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm: add support for __GFP_ZERO flag to dma_pool_alloc()Sean O. Stalley1-2/+7
Currently a call to dma_pool_alloc() with a ___GFP_ZERO flag returns a non-zeroed memory region. This patchset adds support for the __GFP_ZERO flag to dma_pool_alloc(), adds 2 wrapper functions for allocing zeroed memory from a pool, and provides a coccinelle script for finding & replacing instances of dma_pool_alloc() followed by memset(0) with a single dma_pool_zalloc() call. There was some concern that this always calls memset() to zero, instead of passing __GFP_ZERO into the page allocator. [https://lkml.org/lkml/2015/7/15/881] I ran a test on my system to get an idea of how often dma_pool_alloc() calls into pool_alloc_page(). After Boot: [ 30.119863] alloc_calls:541, page_allocs:7 After an hour: [ 3600.951031] alloc_calls:9566, page_allocs:12 After copying 1GB file onto a USB drive: [ 4260.657148] alloc_calls:17225, page_allocs:12 It doesn't look like dma_pool_alloc() calls down to the page allocator very often (at least on my system). This patch (of 4): Currently the __GFP_ZERO flag is ignored by dma_pool_alloc(). Make dma_pool_alloc() zero the memory if this flag is set. Signed-off-by: Sean O. Stalley <sean.stalley@intel.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Gilles Muller <Gilles.Muller@lip6.fr> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Michal Marek <mmarek@suse.cz> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09vmscan: fix increasing nr_isolated incurred by putback unevictable pagesJaewon Kim1-1/+1
reclaim_clean_pages_from_list() assumes that shrink_page_list() returns number of pages removed from the candidate list. But shrink_page_list() puts back mlocked pages without passing it to caller and without counting as nr_reclaimed. This increases nr_isolated. To fix this, this patch changes shrink_page_list() to pass unevictable pages back to caller. Caller will take care those pages. Minchan said: It fixes two issues. 1. With unevictable page, cma_alloc will be successful. Exactly speaking, cma_alloc of current kernel will fail due to unevictable pages. 2. fix leaking of NR_ISOLATED counter of vmstat With it, too_many_isolated works. Otherwise, it could make hang until the process get SIGKILL. Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm: vmscan: never isolate more pages than necessaryVladimir Davydov1-1/+2
If transparent huge pages are enabled, we can isolate many more pages than we actually need to scan, because we count both single and huge pages equally in isolate_lru_pages(). Since commit 5bc7b8aca942d ("mm: thp: add split tail pages to shrink page list in page reclaim"), we scan all the tail pages immediately after a huge page split (see shrink_page_list()). As a result, we can reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run! This is easy to catch on memcg reclaim with zswap enabled. The latter makes swapout instant so that if we happen to scan an unreferenced huge page we will evict both its head and tail pages immediately, which is likely to result in excessive reclaim. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09bootmem: avoid freeing to bootmem after bootmem is doneChris Metcalf1-0/+7
Bootmem isn't popular any more, but some architectures still use it, and freeing to bootmem after calling free_all_bootmem_core() can end up scribbling over random memory. Instead, make sure the kernel generates a warning in this case by ensuring the node_bootmem_map field is non-NULL when are freeing or marking bootmem. An instance of this bug was just fixed in the tile architecture ("tile: use free_bootmem_late() for initrd") and catching this case more widely seems like a good thing. Signed-off-by: Chris Metcalf <cmetcalf@ezchip.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Paul McQuade <paulmcquad@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm, page_isolation: make set/unset_migratetype_isolate() file-localNaoya Horiguchi1-2/+3
Nowaday, set/unset_migratetype_isolate() is defined and used only in mm/page_isolation, so let's limit the scope within the file. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in ↵Aristeu Rozanski1-3/+0
queue_pages_range() This check was introduced as part of 6f4576e3687 ("mempolicy: apply page table walker on queue_pages_range()") which got duplicated by 48684a65b4e ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)") by reintroducing it earlier on queue_page_test_walk() Signed-off-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mem-hotplug: handle node hole when initializing numa_meminfo.Tang Chen1-1/+1
When parsing SRAT, all memory ranges are added into numa_meminfo. In numa_init(), before entering numa_cleanup_meminfo(), all possible memory ranges are in numa_meminfo. And numa_cleanup_meminfo() removes all ranges over max_pfn or empty. But, this only works if the nodes are continuous. Let's have a look at the following example: We have an SRAT like this: SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff] SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff] SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff] SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug On boot, only node 0,1,2,3 exist. And the numa_meminfo will look like this: numa_meminfo.nr_blks = 9 1. on node 0: [0, 60000000] 2. on node 0: [100000000, 20000000000] 3. on node 1: [20000000000, 40000000000] 4. on node 4: [40000000000, 60000000000] 5. on node 5: [60000000000, 80000000000] 6. on node 2: [80000000000, a0000000000] 7. on node 3: [a0000000000, a0800000000] 8. on node 6: [c0000000000, a0800000000] 9. on node 7: [e0000000000, a0800000000] And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the end address is over max_pfn, which is a0800000000. But 4 and 5 are not removed because their end addresses are less then max_pfn. But in fact, node 4 and 5 don't exist. In a word, numa_cleanup_meminfo() is not able to handle holes between nodes. Since memory ranges in node 4 and 5 are in numa_meminfo, in numa_register_memblks(), node 4 and 5 will be mistakenly set to online. If you run lscpu, it will show: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node2 CPU(s): NUMA node3 CPU(s): NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 In this patch, we use memblock_overlaps_region() to check if ranges in numa_meminfo overlap with ranges in memory_block. Since memory_block contains all available memory at boot time, if they overlap, it means the ranges exist. If not, then remove them from numa_meminfo. After this patch, lscpu will show: NUMA node0 CPU(s): 0-14,128-142 NUMA node1 CPU(s): 15-29,143-157 NUMA node4 CPU(s): 62-76,190-204 NUMA node5 CPU(s): 78-92,206-220 Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/memblock.c: make memblock_overlaps_region() return bool.Tang Chen1-5/+5
memblock_overlaps_region() checks if the given memblock region intersects a region in memblock. If so, it returns the index of the intersected region. But its only caller is memblock_is_region_reserved(), and it returns 0 if false, non-zero if true. Both of these should return bool. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vladimir Murzin <vladimir.murzin@arm.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Alexander Kuleshov <kuleshovmail@gmail.com> Cc: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm: madvise allow remove operation for hugetlbfsMike Kravetz1-1/+1
Now that we have hole punching support for hugetlbfs, we can also support the MADV_REMOVE interface to it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09hugetlbfs: add hugetlbfs_fallocate()Mike Kravetz1-1/+1
This is based on the shmem version, but it has diverged quite a bit. We have no swap to worry about, nor the new file sealing. Add synchronication via the fault mutex table to coordinate page faults, fallocate allocation and fallocate hole punch. What this allows us to do is move physical memory in and out of a hugetlbfs file without having it mapped. This also gives us the ability to support MADV_REMOVE since it is currently implemented using fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of a hugetlbfs file, which wasn't possible before. hugetlbfs fallocate only operates on whole huge pages. Based on code by Dave Hansen. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09hugetlbfs: New huge_add_to_page_cache helper routineMike Kravetz1-9/+18
Currently, there is only a single place where hugetlbfs pages are added to the page cache. The new fallocate code be adding a second one, so break the functionality out into its own helper. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hugetlb: alloc_huge_page handle areas hole punched by fallocateMike Kravetz1-15/+39
Areas hole punched by fallocate will not have entries in the region/reserve map. However, shared mappings with min_size subpool reservations may still have reserved pages. alloc_huge_page needs to handle this special case and do the proper accounting. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punchMike Kravetz1-2/+13
In vma_has_reserves(), the current assumption is that reserves are always present for shared mappings. However, this will not be the case with fallocate hole punch. When punching a hole, the present page will be deleted as well as the region/reserve map entry (and hence any reservation). vma_has_reserves is passed "chg" which indicates whether or not a region/reserve map is present. Use this to determine if reserves are actually present or were removed via hole punch. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09hugetlbfs: truncate_hugepages() takes a range of pagesMike Kravetz1-3/+37
Modify truncate_hugepages() to take a range of pages (start, end) instead of simply start. If an end value of LLONG_MAX is passed, the current "truncate" functionality is maintained. Existing callers are modified to pass LLONG_MAX as end of range. By keying off end == LLONG_MAX, the routine behaves differently for truncate and hole punch. Page removal is now synchronized with page allocation via faults by using the fault mutex table. The hole punch case can experience the rare region_del error and must handle accordingly. Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in the case where region_del returns an error. Since the routine handles more than just the truncate case, it is renamed to remove_inode_hugepages(). To be consistent, the routine truncate_huge_page() is renamed remove_huge_page(). Downstream of remove_inode_hugepages(), the routine hugetlb_unreserve_pages() is also modified to take a range of pages. hugetlb_unreserve_pages is modified to detect an error from region_del and pass it back to the caller. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-09mm/hugetlb: expose hugetlb fault mutex for use by fallocateMike Kravetz1-10/+10
hugetlb page faults are currently synchronized by the table of mutexes (htlb_fault_mutex_table). fallocate code will need to synchronize with the page fault code when it allocates or deletes pages. Expose interfaces so that fallocate operations can be synchronized with page faults. Minor name changes to be more consistent with other global hugetlb symbols. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>