summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2015-11-06mm: fix overflow in find_zone_movable_pfns_for_nodes()Xishi Qiu1-0/+1
If the user set "movablecore=xx" to a large number, corepages will overflow. Fix the problem. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Tang Chen <tangchen@cn.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/vmscan.c: fix types of some localsAlexandru Moise1-4/+4
In zone_reclaimable_pages(), `nr' is returned by a function which is declared as returning "unsigned long", so declare it such. Negative values are meaningless here. In zone_pagecache_reclaimable() we should also declare `delta' and `nr_pagecache_reclaimable' as being unsigned longs because they're used to store the values returned by zone_page_state() and zone_unmapped_file_pages() which also happen to return unsigned integers. [akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long] Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, oom: remove task_lock protecting comm printingDavid Rientjes1-7/+1
The oom killer takes task_lock() in a couple of places solely to protect printing the task's comm. A process's comm, including current's comm, may change due to /proc/pid/comm or PR_SET_NAME. The comm will always be NULL-terminated, so the worst race scenario would only be during update. We can tolerate a comm being printed that is in the middle of an update to avoid taking the lock. Other locations in the kernel have already dropped task_lock() when printing comm, so this is consistent. Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, compaction: distinguish contended status in tracepointsVlastimil Babka1-3/+6
Compaction returns prematurely with COMPACT_PARTIAL when contended or has fatal signal pending. This is ok for the callers, but might be misleading in the traces, as the usual reason to return COMPACT_PARTIAL is that we think the allocation should succeed. After this patch we distinguish the premature ending condition in the mm_compaction_finished and mm_compaction_end tracepoints. The contended status covers the following reasons: - lock contention or need_resched() detected in async compaction - fatal signal pending - too many pages isolated in the zone (only for async compaction) Further distinguishing the exact reason seems unnecessary for now. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, compaction: export tracepoints status strings to userspaceVlastimil Babka1-11/+0
Some compaction tracepoints convert the integer return values to strings using the compaction_status_string array. This works for in-kernel printing, but not userspace trace printing of raw captured trace such as via trace-cmd report. This patch converts the private array to appropriate tracepoint macros that result in proper userspace support. trace-cmd output before: transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0 zone=ffffffff81815d7a order=9 ret= after: transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0 zone=ffffffff81815d7a order=9 ret=partial Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/oom_kill.c: suppress unnecessary "sharing same memory" messageTetsuo Handa1-1/+3
oom_kill_process() sends SIGKILL to other thread groups sharing victim's mm. But printing "Kill process %d (%s) sharing same memory\n" lines makes no sense if they already have pending SIGKILL. This patch reduces the "Kill process" lines by printing that line with info level only if SIGKILL is not pending. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/oom_kill.c: fix potentially killing unrelated processTetsuo Handa1-1/+3
At the for_each_process() loop in oom_kill_process(), we are comparing address of OOM victim's mm without holding a reference to that mm. If there are a lot of processes to compare or a lot of "Kill process %d (%s) sharing same memory" messages to print, for_each_process() loop could take very long time. It is possible that meanwhile the OOM victim exits and releases its mm, and then mm is allocated with the same address and assigned to some unrelated process. When we hit such race, the unrelated process will be killed by error. To make sure that the OOM victim's mm does not go away until for_each_process() loop finishes, get a reference on the OOM victim's mm before calling task_unlock(victim). [oleg@redhat.com: several fixes] Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILLTetsuo Handa1-1/+6
It was confirmed that a local unprivileged user can consume all memory reserves and hang up that system using time lag between the OOM killer sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for printk() inside for_each_process() loop at oom_kill_process() can consume many seconds when there are many thread groups sharing the same memory. Before starting oom-depleter process: Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB As of invoking the OOM killer: Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB Between the thread group leader got TIF_MEMDIE and receives SIGKILL: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB The oom-depleter's thread group leader which got TIF_MEMDIE started memset() in user space after the OOM killer set TIF_MEMDIE, and it was free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is set, the oom-depleter can terminate without touching memory reserves. Although the possibility of hitting this time lag is very small for 3.19 and earlier kernels because TIF_MEMDIE is set immediately before sending SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can step between and allow memory allocations which are not needed for terminating the OOM victim. Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/vmscan: make inactive_anon/file_is_low return boolYaowei Bai1-7/+7
Make inactive_anon/file_is_low return bool due to these particular functions only using either one or zero as their return value. No functional change. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/memcontrol.c: fix order calculation in try_charge()Jerome Marchand1-1/+2
Since commit 6539cc053869 ("mm: memcontrol: fold mem_cgroup_do_charge()"), the order to pass to mem_cgroup_oom() is calculated by passing the number of pages to get_order() instead of the expected size in bytes. AFAICT, it only affects the value displayed in the oom warning message. This patch fix this. Michal said: : We haven't noticed that just because the OOM is enabled only for page : faults of order-0 (single page) and get_order work just fine. Thanks for : noticing this. If we ever start triggering OOM on different orders this : would be broken. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: hwpoison: ratelimit messages from unpoison_memory()Naoya Horiguchi1-9/+25
Currently kernel prints out results of every single unpoison event, which i= s not necessary because unpoison is purely a testing feature and testers can = get little or no information from lots of lines of unpoison log storm. So this patch ratelimits printk in unpoison_memory(). This patch introduces a file local ratelimit_state, which adds 64 bytes to memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below, 2= 56 bytes is added, so it's a win. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/filemap.c: make global sync not clear error status of individual inodesJunichi Nomura1-13/+54
filemap_fdatawait() is a function to wait for on-going writeback to complete but also consume and clear error status of the mapping set during writeback. The latter functionality is critical for applications to detect writeback error with system calls like fsync(2)/fdatasync(2). However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl, which don't check error status of individual mappings. As a result, fsync() may not be able to detect writeback error if events happen in the following order: Application System admin ---------------------------------------------------------- write data on page cache Run sync command writeback completes with error filemap_fdatawait() clears error fsync returns success (but the data is not on disk) This patch adds filemap_fdatawait_keep_errors() for call sites where writeback error is not handled so that they don't clear error status. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/compaction.c: add an is_via_compact_memory() helperYaowei Bai1-12/+14
Introduce is_via_compact_memory() helper indicating compacting via /proc/sys/vm/compact_memory to improve readability. To catch this situation in __compaction_suitable, use order as parameter directly instead of using struct compact_control. This patch has no functional changes. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Cc: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/vmscan: make inactive_anon_is_low_global return directlyYaowei Bai1-4/+1
Delete unnecessary if to let inactive_anon_is_low_global return directly. No functional changes. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: hugetlb: proc: add HugetlbPages field to /proc/PID/statusNaoya Horiguchi2-1/+12
Currently there's no easy way to get per-process usage of hugetlb pages, which is inconvenient because userspace applications which use hugetlb typically want to control their processes on the basis of how much memory (including hugetlb) they use. So this patch simply provides easy access to the info via /proc/PID/status. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Joern Engel <joern@logfs.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: use only per-device readahead limitRoman Gushchin2-17/+5
Maximal readahead size is limited now by two values: 1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead()) 2) by configurable per-device value* (bdi->ra_pages) There are devices, which require custom readahead limit. For instance, for RAIDs it's calculated as number of devices multiplied by chunk size times 2. Readahead size can never be larger than bdi->ra_pages * 2 value (POSIX_FADV_SEQUNTIAL doubles readahead size). If so, why do we need two limits? I suggest to completely remove this max_sane_readahead() stuff and use per-device readahead limit everywhere. Also, using right readahead size for RAID disks can significantly increase i/o performance: before: dd if=/dev/md2 of=/dev/null bs=100M count=100 100+0 records in 100+0 records out 10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s after: $ dd if=/dev/md2 of=/dev/null bs=100M count=100 100+0 records in 100+0 records out 10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s (It's an 8-disks RAID5 storage). This patch doesn't change sys_readahead and madvise(MADV_WILLNEED) behavior introduced by 6d2be915e589b58 ("mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages"). Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/page_alloc: remove unused parameter in init_currently_empty_zone()Yaowei Bai2-6/+4
Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops experienced on the Cell architecture when init-time functions, early_*(), are called at runtime by introducing an 'enum memmap_context' parameter to memmap_init_zone() and init_currently_empty_zone(). This parameter is intended to be used to tell whether the call of these two functions is being made on behalf of a hotplug event, or happening at boot-time. However, init_currently_empty_zone() does not use this parameter at all, so remove it. Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, migrate: count pages failing all retries in vmstat and tracepointVlastimil Babka1-1/+2
Migration tries up to 10 times to migrate pages that return -EAGAIN until it gives up. If some pages fail all retries, they are counted towards the number of failed pages that migrate_pages() returns. They should also be counted in the /proc/vmstat pgmigrate_fail and in the mm_migrate_pages tracepoint. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/memblock: make memblock_remove_range() staticAlexander Kuleshov1-1/+1
memblock_remove_range() is only used in the mm/memblock.c, so we can make it static. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mremap: use offset_in_page macroAlexander Kuleshov1-6/+6
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mmap: use offset_in_page macroAlexander Kuleshov1-6/+6
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/vmalloc: use offset_in_page macroAlexander Kuleshov1-6/+6
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mlock: use offset_in_page macroAlexander Kuleshov1-3/+3
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/util: use offset_in_page macroAlexander Kuleshov1-1/+1
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/percpu: use offset_in_page macroAlexander Kuleshov1-5/+5
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/early_ioremap: use offset_in_page macroAlexander Kuleshov1-3/+3
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mincore: use offset_in_page macroAlexander Kuleshov1-1/+1
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/nommu: use offset_in_page macroAlexander Kuleshov1-4/+4
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/msync: use offset_in_page macroAlexander Kuleshov1-1/+1
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/list_lru.c: replace nr_node_ids for loop with for_each_node()Raghavendra K T1-11/+23
The functions used in the patch are in slowpath, which gets called whenever alloc_super is called during mounts. Though this should not make difference for the architectures with sequential numa node ids, for the powerpc which can potentially have sparse node ids (for e.g., 4 node system having numa ids, 0,1,16,17 is common), this patch saves some unnecessary allocations for non existing numa nodes. Even without that saving, perhaps patch makes code more readable. [vdavydov@parallels.com: take memcg_aware check outside for_each loop] Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: fix docbook comment for get_vaddr_frames()Jonathan Corbet1-1/+1
get_vaddr_frames() has a comment that's *almost* a docbook comment; add the missing star so that the tools will find it properly. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06memcg: ratify and consolidate over-charge handlingTejun Heo1-49/+20
try_charge() is the main charging logic of memcg. When it hits the limit but either can't fail the allocation due to __GFP_NOFAIL or the task is likely to free memory very soon, being OOM killed, has SIGKILL pending or exiting, it "bypasses" the charge to the root memcg and returns -EINTR. While this is one approach which can be taken for these situations, it has several issues. * It unnecessarily lies about the reality. The number itself doesn't go over the limit but the actual usage does. memcg is either forced to or actively chooses to go over the limit because that is the right behavior under the circumstances, which is completely fine, but, if at all avoidable, it shouldn't be misrepresenting what's happening by sneaking the charges into the root memcg. * Despite trying, we already do over-charge. kmemcg can't deal with switching over to the root memcg by the point try_charge() returns -EINTR, so it open-codes over-charing. * It complicates the callers. Each try_charge() user has to handle the weird -EINTR exception. memcg_charge_kmem() does the manual over-charging. mem_cgroup_do_precharge() performs unnecessary uncharging of root memcg, which BTW is inconsistent with what memcg_charge_kmem() does but not broken as [un]charging are noops on root memcg. mem_cgroup_try_charge() needs to switch the returned cgroup to the root one. The reality is that in memcg there are cases where we are forced and/or willing to go over the limit. Each such case needs to be scrutinized and justified but there definitely are situations where that is the right thing to do. We alredy do this but with a superficial and inconsistent disguise which leads to unnecessary complications. This patch updates try_charge() so that it over-charges and returns 0 when deemed necessary. -EINTR return is removed along with all special case handling in the callers. While at it, remove the local variable @ret, which was initialized to zero and never changed, along with done: label which just returned the always zero @ret. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06memcg: punt high overage reclaim to return-to-userland pathTejun Heo1-8/+39
Currently, try_charge() tries to reclaim memory synchronously when the high limit is breached; however, if the allocation doesn't have __GFP_WAIT, synchronous reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". slab/slub allocator tries speculative allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the over-high reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it adds the overage to current->memcg_nr_pages_over_high and schedules execution of mem_cgroup_handle_over_high() which performs synchronous reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. It also has the following benefits. - All over-high reclaims can use GFP_KERNEL regardless of the specific gfp mask in use, e.g. GFP_NOFS, when the limit was breached. - It copes with prio inversion. Previously, a low-prio task with small memory.high might perform over-high reclaim with a bunch of locks held. If a higher prio task needed any of these locks, it would have to wait until the low prio task finished reclaim and released the locks. By handing over-high reclaim to the task exit path this issue can be avoided. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@kernel.org> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06memcg: flatten task_struct->memcg_oomTejun Heo1-8/+8
task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mmap.c: remove useless statement "vma = NULL" in find_vma()Chen Gang1-1/+0
Before the main loop, vma is already is NULL. There is no need to set it to NULL again. Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06uaccess: reimplement probe_kernel_address() using probe_kernel_read()Andrew Morton1-0/+5
probe_kernel_address() is basically the same as the (later added) probe_kernel_read(). The return value on EFAULT is a bit different: probe_kernel_address() returns number-of-bytes-not-copied whereas probe_kernel_read() returns -EFAULT. All callers have been checked, none cared. probe_kernel_read() can be overridden by the architecture whereas probe_kernel_address() cannot. parisc, blackfin and um do this, to insert additional checking. Hence this patch possibly fixes obscure bugs, although there are only two probe_kernel_address() callsites outside arch/. My first attempt involved removing probe_kernel_address() entirely and converting all callsites to use probe_kernel_read() directly, but that got tiresome. This patch shrinks mm/slab_common.o by 218 bytes. For a single probe_kernel_address() callsite. Cc: Steven Miao <realmz6@gmail.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/mlock.c: reorganize mlockall() return values and remove goto-out labelAlexey Klimov1-5/+4
In mlockall syscall wrapper after out-label for goto code just doing return. Remove goto out statements and return error values directly. Also instead of rewriting ret variable before every if-check move returns to 'error'-like path under if-check. Objdump asm listing showed me reducing by few asm lines. Object file size descreased from 220592 bytes to 220528 bytes for me (for aarch64). Signed-off-by: Alexey Klimov <klimov.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/kmemleak.c: remove unneeded initialization of object to NULLAlexey Klimov1-1/+1
Few lines below object is reinitialized by lookup_object() so we don't need to init it by NULL in the beginning of find_and_get_object(). Signed-off-by: Alexey Klimov <alexey.klimov@linaro.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm: slab: only move management objects off-slab for sizes larger than ↵Catalin Marinas1-2/+3
KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slub: calculate start order with reserved in considerationWei Yang1-5/+1
In slub_order(), the order starts from max(min_order, get_order(min_objects * size)). When (min_objects * size) has different order from (min_objects * size + reserved), it will skip this order via a check in the loop. This patch optimizes this a little by calculating the start order with `reserved' in consideration and removing the check in loop. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slub: use get_order() instead of fls()Wei Yang1-2/+1
get_order() is more easy to understand. This patch just replaces it. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slub: correct the comment in calculate_order()Wei Yang1-1/+1
In calculate_order(), it tries to calculate the best order by adjusting the fraction and min_objects. On each iteration on min_objects, fraction iterates on 16, 8, 4. Which means the acceptable waste increases with 1/16, 1/8, 1/4. This patch corrects the comment according to the code. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slab_common.c: initialize kmem_cache pointer to NULLAlexandru Moise1-2/+1
The assignment to NULL within the error condition was written in a 2014 patch to suppress a compiler warning. However it would be cleaner to just initialize the kmem_cache to NULL and just return it in case of an error condition. Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slab_common.c: do not warn that cache is busy on destroy more than onceVladimir Davydov1-6/+7
Currently, when kmem_cache_destroy() is called for a global cache, we print a warning for each per memcg cache attached to it that has active objects (see shutdown_cache). This is redundant, because it gives no new information and only clutters the log. If a cache being destroyed has active objects, there must be a memory leak in the module that created the cache, and it does not matter if the cache was used by users in memory cgroups or not. This patch moves the warning from shutdown_cache(), which is called for shutting down both global and per memcg caches, to kmem_cache_destroy(), so that the warning is only printed once if there are objects left in the cache being destroyed. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slab_common.c: clear pointers to per memcg caches on destroyVladimir Davydov2-21/+78
Currently, we do not clear pointers to per memcg caches in the memcg_params.memcg_caches array when a global cache is destroyed with kmem_cache_destroy. This is fine if the global cache does get destroyed. However, a cache can be left on the list if it still has active objects when kmem_cache_destroy is called (due to a memory leak). If this happens, the entries in the array will point to already freed areas, which is likely to result in data corruption when the cache is reused (via slab merging). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm/slab_common.c: rename cache create/destroy helpersVladimir Davydov1-19/+18
do_kmem_cache_create(), do_kmem_cache_shutdown(), and do_kmem_cache_release() sound awkward for static helper functions that are not supposed to be used outside slab_common.c. Rename them to create_cache(), shutdown_cache(), and release_caches(), respectively. This patch is a pure cleanup and does not introduce any functional changes. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06slab: convert slab_is_available() to booleanDenis Kirjanov1-1/+1
A good candidate to return a boolean result. Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org> Cc: Christoph Lameter <cl@linux.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05Merge tag 'driver-core-4.4-rc1' of ↵Linus Torvalds2-8/+8
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch of debugfs updates, with a smattering of minor driver core fixes and updates as well. All have been in linux-next for a long time" * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: debugfs: Add debugfs_create_ulong() of: to support binding numa node to specified device in devicetree debugfs: Add read-only/write-only bool file ops debugfs: Add read-only/write-only size_t file ops debugfs: Add read-only/write-only x64 file ops debugfs: Consolidate file mode checks in debugfs_create_*() Revert "mm: Check if section present during memory block (un)registering" driver-core: platform: Provide helpers for multi-driver modules mm: Check if section present during memory block (un)registering devres: fix a for loop bounds check CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit base/platform: assert that dev_pm_domain callbacks are called unconditionally sysfs: correctly handle short reads on PREALLOC attrs. base: soc: siplify ida usage kobject: move EXPORT_SYMBOL() macros next to corresponding definitions kobject: explain what kobject's sd field is debugfs: document that debugfs_remove*() accepts NULL and error values debugfs: Pass bool pointer to debugfs_create_bool() ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'
2015-11-05Merge tag 'for-linus-4.4-rc0-tag' of ↵Linus Torvalds1-8/+21
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from David Vrabel: - Improve balloon driver memory hotplug placement. - Use unpopulated hotplugged memory for foreign pages (if supported/enabled). - Support 64 KiB guest pages on arm64. - CPU hotplug support on arm/arm64. * tag 'for-linus-4.4-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (44 commits) xen: fix the check of e_pfn in xen_find_pfn_range x86/xen: add reschedule point when mapping foreign GFNs xen/arm: don't try to re-register vcpu_info on cpu_hotplug. xen, cpu_hotplug: call device_offline instead of cpu_down xen/arm: Enable cpu_hotplug.c xenbus: Support multiple grants ring with 64KB xen/grant-table: Add an helper to iterate over a specific number of grants xen/xenbus: Rename *RING_PAGE* to *RING_GRANT* xen/arm: correct comment in enlighten.c xen/gntdev: use types from linux/types.h in userspace headers xen/gntalloc: use types from linux/types.h in userspace headers xen/balloon: Use the correct sizeof when declaring frame_list xen/swiotlb: Add support for 64KB page granularity xen/swiotlb: Pass addresses rather than frame numbers to xen_arch_need_swiotlb arm/xen: Add support for 64KB page granularity xen/privcmd: Add support for Linux 64KB page granularity net/xen-netback: Make it running on 64KB page granularity net/xen-netfront: Make it running on 64KB page granularity block/xen-blkback: Make it running on 64KB page granularity block/xen-blkfront: Make it running on 64KB page granularity ...
2015-11-04Merge tag 'arc-4.4-rc1' of ↵Linus Torvalds2-53/+49
git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC updates from Vineet Gupta: - Support for new MM features in ARCv2 cores (THP, PAE40) Some generic THP bits are touched - all ACKed by Kirill - Platform framework updates to prepare for EZChip arrival (still in works) - ARC Public Mailing list setup finally (linux-snps-arc@lists.infraded.org) * tag 'arc-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: (42 commits) ARC: mm: PAE40 support ARC: mm: PAE40: tlbex.S: Explicitify the size of pte_t ARC: mm: PAE40: switch to using phys_addr_t for physical addresses ARC: mm: HIGHMEM: populate high memory from DT ARC: mm: HIGHMEM: kmap API implementation ARC: mm: preps ahead of HIGHMEM support #2 ARC: mm: preps ahead of HIGHMEM support ARC: mm: use generic macros _BITUL()/_AC() ARC: mm: Improve Duplicate PD Fault handler MAINTAINERS: Add public mailing list for ARC ARC: Ensure DT mem base is same as what kernel is built with ARC: boot: Non Master cpus only need to call EARLY_CPU_SETUP once ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_smp() ARC: smp: Introduce smp hook @init_irq_cpu called for all cores ARC: smp: Rename platform hook @init_smp -> @init_cpu_smp ARCv2: smp: [plat-*]: No need to explicitly call mcip_init_early_smp() ARC: smp: Introduce smp hook @init_early_smp for Master core ARC: remove @init_time, @init_irq platform callbacks ARC: smp: irqchip: handle IPI as percpu irq like timer ARC: boot: Support Halt-on-reset and Run-on-reset SMP booting modes ...