summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-09-26mm, x86: Account for TLB flushes only when debuggingMel Gorman1-1/+3
commit ec65993443736a5091b68e80ff1734548944a4b8 upstream. Bisection between 3.11 and 3.12 fingered commit 9824cf97 ("mm: vmstats: tlb flush counters") to cause overhead problems. The counters are undeniably useful but how often do we really need to debug TLB flush related issues? It does not justify taking the penalty everywhere so make it a debugging option. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Alex Shi <alex.shi@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-XzxjntugxuwpxXhcrxqqh53b@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: __rmqueue_fallback() should respect pageblock typeKOSAKI Motohiro1-10/+5
commit 0cbef29a782162a3896487901eca4550bfa397ef upstream. When __rmqueue_fallback() doesn't find a free block with the required size it splits a larger page and puts the rest of the page onto the free list. But it has one serious mistake. When putting back, __rmqueue_fallback() always use start_migratetype if type is not CMA. However, __rmqueue_fallback() is only called when all of the start_migratetype queue is empty. That said, __rmqueue_fallback always puts back memory to the wrong queue except try_to_steal_freepages() changed pageblock type (i.e. requested size is smaller than half of page block). The end result is that the antifragmentation framework increases fragmenation instead of decreasing it. Mel's original anti fragmentation does the right thing. But commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it. This patch restores sane and old behavior. It also removes an incorrect comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c: restructure free-page stealing code and fix a bug"). Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag()KOSAKI Motohiro1-3/+2
commit 52c8f6a5aeb0bdd396849ecaa72d96f8175528f5 upstream. In general, every tracepoint should be zero overhead if it is disabled. However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate "new_type == start_migratetype" even if tracepoint is disabled. However, the code can be moved into tracepoint's TP_fast_assign() and TP_fast_assign exist exactly such purpose. This patch does it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26readahead: fix sequential read cache miss detectionDamien Ramonda1-1/+5
commit af248a0c67457e5c6d2bcf288f07b4b2ed064f1f upstream. The kernel's readahead algorithm sometimes interprets random read accesses as sequential and triggers unnecessary data prefecthing from storage device (impacting random read average latency). In order to identify sequential cache read misses, the readahead algorithm intends to check whether offset - previous offset == 1 (trivial sequential reads) or offset - previous offset == 0 (sequential reads not aligned on page boundary): if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) The current offset is stored in the "offset" variable of type "pgoff_t" (unsigned long), while previous offset is stored in "ra->prev_pos" of type "loff_t" (long long). Therefore, operands of the if statement are implicitly converted to type long long. Consequently, when previous offset > current offset (which happens on random pattern), the if condition is true and access is wrongly interpeted as sequential. An unnecessary data prefetching is triggered, impacting the average random read latency. Storing the previous offset value in a "pgoff_t" variable (unsigned long) fixes the sequential read detection logic. Signed-off-by: Damien Ramonda <damien.ramonda@intel.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Pierre Tardy <pierre.tardy@intel.com> Acked-by: David Cohen <david.a.cohen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26swap: change swap_list_head to plist, add swap_avail_headDan Streetman2-57/+94
commit 18ab4d4ced0817421e6db6940374cc39d28d65da upstream. Originally get_swap_page() started iterating through the singly-linked list of swap_info_structs using swap_list.next or highest_priority_index, which both were intended to point to the highest priority active swap target that was not full. The first patch in this series changed the singly-linked list to a doubly-linked list, and removed the logic to start at the highest priority non-full entry; it starts scanning at the highest priority entry each time, even if the entry is full. Replace the manually ordered swap_list_head with a plist, swap_active_head. Add a new plist, swap_avail_head. The original swap_active_head plist contains all active swap_info_structs, as before, while the new swap_avail_head plist contains only swap_info_structs that are active and available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect the swap_avail_head list. Mel Gorman suggested using plists since they internally handle ordering the list entries based on priority, which is exactly what swap was doing manually. All the ordering code is now removed, and swap_info_struct entries and simply added to their corresponding plist and automatically ordered correctly. Using a new plist for available swap_info_structs simplifies and optimizes get_swap_page(), which no longer has to iterate over full swap_info_structs. Using a new spinlock for swap_avail_head plist allows each swap_info_struct to add or remove themselves from the plist when they become full or not-full; previously they could not do so because the swap_info_struct->lock is held when they change from full<->not-full, and the swap_lock protecting the main swap_active_head must be ordered before any swap_info_struct->lock. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26swap: change swap_info singly-linked list to list_headDan Streetman2-108/+76
commit adfab836f4908deb049a5128082719e689eed964 upstream. The logic controlling the singly-linked list of swap_info_struct entries for all active, i.e. swapon'ed, swap targets is rather complex, because: - it stores the entries in priority order - there is a pointer to the highest priority entry - there is a pointer to the highest priority not-full entry - there is a highest_priority_index variable set outside the swap_lock - swap entries of equal priority should be used equally this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181 where different priority swap targets are incorrectly used equally. That bug probably could be solved with the existing singly-linked lists, but I think it would only add more complexity to the already difficult to understand get_swap_page() swap_list iteration logic. The first patch changes from a singly-linked list to a doubly-linked list using list_heads; the highest_priority_index and related code are removed and get_swap_page() starts each iteration at the highest priority swap_info entry, even if it's full. While this does introduce unnecessary list iteration (i.e. Schlemiel the painter's algorithm) in the case where one or more of the highest priority entries are full, the iteration and manipulation code is much simpler and behaves correctly re: the above bug; and the fourth patch removes the unnecessary iteration. The second patch adds some minor plist helper functions; nothing new really, just functions to match existing regular list functions. These are used by the next two patches. The third patch adds plist_requeue(), which is used by get_swap_page() in the next patch - it performs the requeueing of same-priority entries (which moves the entry to the end of its priority in the plist), so that all equal-priority swap_info_structs get used equally. The fourth patch converts the main list into a plist, and adds a new plist that contains only swap_info entries that are both active and not full. As Mel suggested using plists allows removing all the ordering code from swap - plists handle ordering automatically. The list naming is also clarified now that there are two lists, with the original list changed from swap_list_head to swap_active_head and the new list named swap_avail_head. A new spinlock is also added for the new list, so swap_info entries can be added or removed from the new list immediately as they become full or not full. This patch (of 4): Replace the singly-linked list tracking active, i.e. swapon'ed, swap_info_struct entries with a doubly-linked list using struct list_heads. Simplify the logic iterating and manipulating the list of entries, especially get_swap_page(), by using standard list_head functions, and removing the highest priority iteration logic. The change fixes the bug: https://lkml.org/lkml/2014/2/13/181 in which different priority swap entries after the highest priority entry are incorrectly used equally in pairs. The swap behavior is now as advertised, i.e. different priority swap entries are used in order, and equal priority swap targets are used concurrently. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: exclude memoryless nodes from zone_reclaimMichal Hocko1-2/+3
commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream. We had a report about strange OOM killer strikes on a PPC machine although there was a lot of swap free and a tons of anonymous memory which could be swapped out. In the end it turned out that the OOM was a side effect of zone reclaim which wasn't unmapping and swapping out and so the system was pushed to the OOM. Although this sounds like a bug somewhere in the kswapd vs. zone reclaim vs. direct reclaim interaction numactl on the said hardware suggests that the zone reclaim should not have been set in the first place: node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: node 2 size: 7168 MB node 2 free: 6019 MB node distances: node 0 2 0: 10 40 2: 40 10 So all the CPUs are associated with Node0 which doesn't have any memory while Node2 contains all the available memory. Node distances cause an automatic zone_reclaim_mode enabling. Zone reclaim is intended to keep the allocations local but this doesn't make any sense on the memoryless nodes. So let's exclude such nodes for init_zone_allows_reclaim which evaluates zone reclaim behavior and suitable reclaim_nodes. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan1-0/+13
commit 457c1b27ed56ec472d202731b12417bff023594a upstream. Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: fix bad rss-counter if remap_file_pages raced migrationHugh Dickins1-6/+22
commit 887843961c4b4681ee993c36d4997bf4b4aa8253 upstream. Fix some "Bad rss-counter state" reports on exit, arising from the interaction between page migration and remap_file_pages(): zap_pte() must count a migration entry when zapping it. And yes, it is possible (though very unusual) to find an anon page or swap entry in a VM_SHARED nonlinear mapping: coming from that horrid get_user_pages(write, force) case which COWs even in a shared mapping. Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: Sasha Levin sasha.levin@oracle.com> Tested-by: Dave Jones davej@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: prevent setting of a value less than 0 to min_free_kbytesHan Pingtian1-1/+6
commit da8c757b080ee84f219fa2368cb5dd23ac304fc0 upstream. If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing proc_dointvec() to proc_dointvec_minmax() in the min_free_kbytes_sysctl_handler() can prevent this to happen. mhocko said: : You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make : your machine unusable but I agree that proc_dointvec_minmax is more : suitable here as we already have: : : .proc_handler = min_free_kbytes_sysctl_handler, : .extra1 = &zero, : : It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references : to ctl_name and strategy from the generic sysctl table") has removed : sysctl_intvec strategy and so extra1 is ignored. Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26slab: correct pfmemalloc checkJoonsoo Kim1-3/+5
commit 73293c2f900d0adbb6a415b312cd57976d5ae242 upstream. We checked pfmemalloc by slab unit, not page unit. You can see this in is_slab_pfmemalloc(). So other pages don't need to be set/cleared pfmemalloc. And, therefore we should check pfmemalloc in page flag of first page, but current implementation don't do that. virt_to_head_page(obj) just return 'struct page' of that object, not one of first page, since the SLAB don't use __GFP_COMP when CONFIG_MMU. To get 'struct page' of first page, we first get a slab and try to get it via virt_to_head_page(slab->s_mem). Acked-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Pekka Enberg <penberg@iki.fi> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: thp: khugepaged: add policy for finding target nodeBob Liu1-9/+44
commit 9f1b868a13ac36bd207a571f5ea1193d823ab18d upstream. Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a hugepage which is allocated from the node of the first scanned normal page, but this policy is too rough and may end with unexpected result to upper users. The problem is the original page-balancing among all nodes will be broken after hugepaged started. Thinking about the case if the first scanned normal page is allocated from node A, most of other scanned normal pages are allocated from node B or C.. But hugepaged will always allocate hugepage from node A which will cause extra memory pressure on node A which is not the situation before khugepaged started. This patch try to fix this problem by making khugepaged allocate hugepage from the node which have max record of scaned normal pages hit, so that the effect to original page-balancing can be minimized. The other problem is if normal scanned pages are equally allocated from Node A,B and C, after khugepaged started Node A will still suffer extra memory pressure. Andrew Davidoff reported a related issue several days ago. He wanted his application interleaving among all nodes and "numactl --interleave=all ./test" was used to run the testcase, but the result wasn't not as expected. cat /proc/2814/numa_maps: 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098 The end result showed that most pages are from Node3 instead of interleave among node0-3 which was unreasonable. This patch also fix this issue by allocating hugepage round robin from all nodes have the same record, after this patch the result was as expected: 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722 The simple testcase is like this: int main() { char *p; int i; int j; for (i=0; i < 200; i++) { p = (char *)malloc(1048576); printf("malloc done\n"); if (p == 0) { printf("Out of memory\n"); return 1; } for (j=0; j < 1048576; j++) { p[j] = 'A'; } printf("touched memory\n"); sleep(1); } printf("enter sleep\n"); while(1) { sleep(100); } } [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()] Reported-by: Andrew Davidoff <davidoff@qedmf.net> Tested-by: Andrew Davidoff <davidoff@qedmf.net> Signed-off-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm: thp: cleanup: mv alloc_hugepage to better placeBob Liu1-8/+6
commit 10dc4155c7714f508fe2e4667164925ea971fb25 upstream. Move alloc_hugepage() to a better place, no need for a seperate #ifndef CONFIG_NUMA Signed-off-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Davidoff <davidoff@qedmf.net> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17vm_is_stack: use for_each_thread() rather then buggy while_each_thread()Oleg Nesterov1-6/+3
commit 4449a51a7c281602d3a385044ab928322a122a02 upstream. Aleksei hit the soft lockup during reading /proc/PID/smaps. David investigated the problem and suggested the right fix. while_each_thread() is racy and should die, this patch updates vm_is_stack(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Aleksei Besogonov <alex.besogonov@gmail.com> Tested-by: Aleksei Besogonov <alex.besogonov@gmail.com> Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-26mm: hugetlb: fix copy_hugetlb_page_range()Naoya Horiguchi1-0/+1
commit 0253d634e0803a8376a0d88efee0bf523d8673f9 upstream. Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry") changed the order of huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage in some workloads like hugepage-backed heap allocation via libhugetlbfs. This patch fixes it. The test program for the problem is shown below: $ cat heap.c #include <unistd.h> #include <stdlib.h> #include <string.h> #define HPS 0x200000 int main() { int i; char *p = malloc(HPS); memset(p, '1', HPS); for (i = 0; i < 5; i++) { if (!fork()) { memset(p, '2', HPS); p = malloc(HPS); memset(p, '3', HPS); free(p); return 0; } } sleep(1); free(p); return 0; } $ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry"), so is applicable to -stable kernels which include it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Guillaume Morin <guillaume@morinfr.org> Suggested-by: Guillaume Morin <guillaume@morinfr.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> [2.6.37+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-26hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entryNaoya Horiguchi1-28/+43
commit 4a705fef986231a3e7a6b1a6d3c37025f021f49f upstream. There's a race between fork() and hugepage migration, as a result we try to "dereference" a swap entry as a normal pte, causing kernel panic. The cause of the problem is that copy_hugetlb_page_range() can't handle "swap entry" family (migration entry and hwpoisoned entry) so let's fix it. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-19memcg: oom_notify use-after-free fixMichal Hocko1-0/+4
commit 2bcf2e92c3918ce62ab4e934256e47e9a16d19c3 upstream. Paul Furtado has reported the following GPF: general protection fault: 0000 [#1] SMP Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1 task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000 RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240 RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283 RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200 RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800 R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30 FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660 Call Trace: pagefault_out_of_memory+0x18/0x90 mm_fault_error+0xa9/0x1a0 __do_page_fault+0x478/0x4c0 do_page_fault+0x2c/0x40 page_fault+0x28/0x30 Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75 RIP mem_cgroup_oom_synchronize+0x140/0x240 Commit fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock assuming it is protected by the hierarchical OOM-lock. Although this is true for the notification part the protection doesn't cover unregistration of event which can happen in parallel now so mem_cgroup_oom_notify can see already unlinked and/or freed mem_cgroup_eventfd_list. Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881 Fixes: fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup) Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Paul Furtado <paulfurtado91@gmail.com> Tested-by: Paul Furtado <paulfurtado91@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-19mm, thp: do not allow thp faults to avoid cpuset restrictionsDavid Rientjes1-8/+8
commit b104a35d32025ca740539db2808aa3385d0f30eb upstream. The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET should be set in allocflags. ALLOC_CPUSET controls if a page allocation should be restricted only to the set of allowed cpuset mems. Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent the fault path from using memory compaction or direct reclaim. Thus, it is unfairly able to allocate outside of its cpuset mems restriction as a side-effect. This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is truly GFP_ATOMIC by verifying it is also not a thp allocation. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Alex Thorlton <athorlton@sgi.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Bob Liu <lliubbo@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Hedi Berriche <hedi@sgi.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-08-19mm/page-writeback.c: fix divide by zero in bdi_dirty_limits()Maxim Patlasov1-3/+3
commit f6789593d5cea42a4ecb1cbeab6a23ade5ebbba7 upstream. Under memory pressure, it is possible for dirty_thresh, calculated by global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if strictlimit is true, bdi_dirty_limits() tries to resolve the proportion: bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh by dividing by zero. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-30slab_common: fix the check for duplicate slab namesMikulas Patocka1-1/+1
commit 694617474e33b8603fc76e090ed7d09376514b1a upstream. The patch 3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the problem where kmem_cache_create incorrectly reports duplicate cache name and fails. The problem is described in the header of that patch. However, the patch doesn't really fix the problem because of these reasons: * the logic to test for debugging is reversed. It was intended to perform the check only if slub debugging is enabled (which implies that caches with the same parameters are not merged). Therefore, there should be #if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON) The current code has the condition reversed and performs the test if debugging is disabled. * slub debugging may be enabled or disabled based on kernel command line, CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test based on definition of CONFIG_SLUB_DEBUG_ON is unreliable. This patch fixes the problem by removing the test "!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never checked if the SLUB allocator is used. Note to stable kernel maintainers: when backporint this patch, please backport also the patch 3e374919b314f20e2a04f641ebc1093d758f66a4. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-28mm: kmemleak: avoid false negatives on vmalloc'ed objectsCatalin Marinas2-5/+13
commit 7f88f88f83ed609650a01b18572e605ea50cd163 upstream. Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap blocks") had the side effect of making vmap_area.va_end member point to the next vmap_area.va_start. This was creating an artificial reference to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks. This patch marks the vmap_area containing pointers explicitly and reduces the min ref_count to 2 as vm_struct still contains a reference to the vmalloc'ed object. The kmemleak add_scan_area() function has been improved to allow a SIZE_MAX argument covering the rest of the object (for simpler calling sites). Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-28shmem: fix splicing from a hole while it's punchedHugh Dickins1-9/+15
commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream. shmem_fault() is the actual culprit in trinity's hole-punch starvation, and the most significant cause of such problems: since a page faulted is one that then appears page_mapped(), needing unmap_mapping_range() and i_mmap_mutex to be unmapped again. But it is not the only way in which a page can be brought into a hole in the radix_tree while that hole is being punched; and Vlastimil's testing implies that if enough other processors are busy filling in the hole, then shmem_undo_range() can be kept from completing indefinitely. shmem_file_splice_read() is the main other user of SGP_CACHE, which can instantiate shmem pagecache pages in the read-only case (without holding i_mutex, so perhaps concurrently with a hole-punch). Probably it's silly not to use SGP_READ already (using the ZERO_PAGE for holes): which ought to be safe, but might bring surprises - not a change to be rushed. shmem_read_mapping_page_gfp() is an internal interface used by drivers/gpu/drm GEM (and next by uprobes): it should be okay. And shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when called internally by the kernel (perhaps for a stacking filesystem, which might rely on holes to be reserved): it's unclear whether it could be provoked to keep hole-punch busy or not. We could apply the same umbrella as now used in shmem_fault() to shmem_file_splice_read() and the others; but it looks ugly, and use over a range raises questions - should it actually be per page? can these get starved themselves? The origin of this part of the problem is my v3.1 commit d0823576bf4b ("mm: pincer in truncate_inode_pages_range"), once it was duplicated into shmem.c. It seemed like a nice idea at the time, to ensure (barring RCU lookup fuzziness) that there's an instant when the entire hole is empty; but the indefinitely repeated scans to ensure that make it vulnerable. Revert that "enhancement" to hole-punch from shmem_undo_range(), but retain the unproblematic rescanning when it's truncating; add a couple of comments there. Remove the "indices[0] >= end" test: that is now handled satisfactorily by the inner loop, and mem_cgroup_uncharge_start()/end() are too light to be worth avoiding here. But if we do not always loop indefinitely, we do need to handle the case of swap swizzled back to page before shmem_free_swap() gets it: add a retry for that case, as suggested by Konstantin Khlebnikov; and for the case of page swizzled back to swap, as suggested by Johannes Weiner. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Suggested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lukas Czerner <lczerner@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-28shmem: fix faulting into a hole, not taking i_mutexHugh Dickins1-26/+52
commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream. Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's punched") was buggy: Sasha sent a lockdep report to remind us that grabbing i_mutex in the fault path is a no-no (write syscall may already hold i_mutex while faulting user buffer). We tried a completely different approach (see following patch) but that proved inadequate: good enough for a rational workload, but not good enough against trinity - which forks off so many mappings of the object that contention on i_mmap_mutex while hole-puncher holds i_mutex builds into serious starvation when concurrent faults force the puncher to fall back to single-page unmap_mapping_range() searches of the i_mmap tree. So return to the original umbrella approach, but keep away from i_mutex this time. We really don't want to bloat every shmem inode with a new mutex or completion, just to protect this unlikely case from trinity. So extend the original with wait_queue_head on stack at the hole-punch end, and wait_queue item on the stack at the fault end. This involves further use of i_lock to guard against the races: lockdep has been happy so far, and I see fs/inode.c:unlock_new_inode() holds i_lock around wake_up_bit(), which is comparable to what we do here. i_lock is more convenient, but we could switch to shmem's info->lock. This issue has been tagged with CVE-2014-4171, which will require commit f00cdc6df7d7 and this and the following patch to be backported: we suggest to 3.1+, though in fact the trinity forkbomb effect might go back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might not, since much has changed, with i_mmap_mutex a spinlock before 3.0. Anyone running trinity on 3.0 and earlier? I don't think we need care. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lukas Czerner <lczerner@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> [3.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-28shmem: fix faulting into a hole while it's punchedHugh Dickins1-4/+52
commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream. Trinity finds that mmap access to a hole while it's punched from shmem can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE) from completing, until the reader chooses to stop; with the puncher's hold on i_mutex locking out all other writers until it can complete. It appears that the tmpfs fault path is too light in comparison with its hole-punching path, lacking an i_data_sem to obstruct it; but we don't want to slow down the common case. Extend shmem_fallocate()'s existing range notification mechanism, so shmem_fault() can refrain from faulting pages into the hole while it's punched, waiting instead on i_mutex (when safe to sleep; or repeatedly faulting when not). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-22Don't trigger congestion wait on dirty-but-not-writeout pagesLinus Torvalds1-6/+5
commit b738d764652dc5aab1c8939f637112981fce9e0e upstream. shrink_inactive_list() used to wait 0.1s to avoid congestion when all the pages that were isolated from the inactive list were dirty but not under active writeback. That makes no real sense, and apparently causes major interactivity issues under some loads since 3.11. The ostensible reason for it was to wait for kswapd to start writing pages, but that seems questionable as well, since the congestion wait code seems to trigger for kswapd itself as well. Also, the logic behind delaying anything when we haven't actually started writeback is not clear - it only delays actually starting that writeback. We'll still trigger the congestion waiting if (a) the process is kswapd, and we hit pages flagged for immediate reclaim (b) the process is not kswapd, and the zone backing dev writeback is actually congested. This probably needs to be revisited, but as it is this fixes a reported regression. [mhocko@suse.cz: backport to 3.12 stable tree] Fixes: e2be15f6c3ee ('mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered') Reported-by: Felipe Contreras <felipe.contreras@gmail.com> Pinpointed-by: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-18cpuset,mempolicy: fix sleeping function called from invalid contextGu Zheng1-2/+0
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream. When runing with the kernel(3.15-rc7+), the follow bug occurs: [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586 [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python [ 9969.441175] INFO: lockdep is turned off. [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85 [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012 [ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18 [ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c [ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000 [ 9969.974071] Call Trace: [ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66 [ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130 [ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0 [ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210 [ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140 [ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150 [ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150 [ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40 [ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10 [ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50 [ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40 [ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380 [ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0 [ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20 [ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90 [ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b The cause is that cpuset_mems_allowed() try to take mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock protection region to protect the access to cpuset only in current_cpuset_is_being_rebound(). So that we can avoid this bug. This patch is a temporary solution that just addresses the bug mentioned above, can not fix the long-standing issue about cpuset.mems rebinding on fork(): "When the forker's task_struct is duplicated (which includes ->mems_allowed) and it races with an update to cpuset_being_rebound in update_tasks_nodemask() then the task's mems_allowed doesn't get updated. And the child task's mems_allowed can be wrong if the cpuset's nodemask changes before the child has been added to the cgroup's tasklist." Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-18mm: fix crashes from mbind() merging vmasHugh Dickins1-26/+20
commit d05f0cdcbe6388723f1900c549b4850360545201 upstream. In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") introduced vma merging to mbind(), but it should have also changed the convention of passing start vma from queue_pages_range() (formerly check_range()) to new_vma_page(): vma merging may have already freed that structure, resulting in BUG at mm/mempolicy.c:1738 and probably worse crashes. Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem") Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-18mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDERMichal Nazarewicz1-2/+14
commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 upstream. With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE, the following is triggered at early boot: SMP: Total of 8 processors activated. devtmpfs: initialized Unable to handle kernel NULL pointer dereference at virtual address 00000008 pgd = fffffe0000050000 [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407 Internal error: Oops: 96000006 [#1] SMP Modules linked in: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44 task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000 PC is at __list_add+0x10/0xd4 LR is at free_one_page+0x270/0x638 ... Call trace: __list_add+0x10/0xd4 free_one_page+0x26c/0x638 __free_pages_ok.part.52+0x84/0xbc __free_pages+0x74/0xbc init_cma_reserved_pageblock+0xe8/0x104 cma_init_reserved_areas+0x190/0x1e4 do_one_initcall+0xc4/0x154 kernel_init_freeable+0x204/0x2a8 kernel_init+0xc/0xd4 This happens because init_cma_reserved_pageblock() calls __free_one_page() with pageblock_order as page order but it is bigger than MAX_ORDER. This in turn causes accesses past zone->free_list[]. Fix the problem by changing init_cma_reserved_pageblock() such that it splits pageblock into individual MAX_ORDER pages if pageblock is bigger than a MAX_ORDER page. In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all architectures expect for ia64, powerpc and tile at the moment, the “pageblock_order > MAX_ORDER” condition will be optimised out since both sides of the operator are constants. In cases where pageblock size is variable, the performance degradation should not be significant anyway since init_cma_reserved_pageblock() is called only at boot time at most MAX_CMA_AREAS times which by default is eight. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Mark Salter <msalter@redhat.com> Tested-by: Christopher Covington <cov@codeaurora.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-18mm, pcp: allow restoring percpu_pagelist_fraction defaultDavid Rientjes1-12/+28
commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream. Oleg reports a division by zero error on zero-length write() to the percpu_pagelist_fraction sysctl: divide error: 0000 [#1] SMP DEBUG_PAGEALLOC CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000 RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120 RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246 RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010 RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50 R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060 R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800 FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0 Call Trace: proc_sys_call_handler+0xb3/0xc0 proc_sys_write+0x14/0x20 vfs_write+0xba/0x1e0 SyS_write+0x46/0xb0 tracesys+0xe1/0xe6 However, if the percpu_pagelist_fraction sysctl is set by the user, it is also impossible to restore it to the kernel default since the user cannot write 0 to the sysctl. This patch allows the user to write 0 to restore the default behavior. It still requires a fraction equal to or larger than 8, however, as stated by the documentation for sanity. If a value in the range [1, 7] is written, the sysctl will return EINVAL. This successfully solves the divide by zero issue at the same time. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Oleg Drokin <green@linuxhacker.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02ext4: fix data integrity sync in ordered modeNamjae Jeon1-5/+6
commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream. When we perform a data integrity sync we tag all the dirty pages with PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check for this tag in write_cache_pages_da and creates a struct mpage_da_data containing contiguously indexed pages tagged with this tag and sync these pages with a call to mpage_da_map_and_submit. This process is done in while loop until all the PAGECACHE_TAG_TOWRITE pages are synced. We also do journal start and stop in each iteration. journal_stop could initiate journal commit which would call ext4_writepage which in turn will call ext4_bio_write_page even for delayed OR unwritten buffers. When ext4_bio_write_page is called for such buffers, even though it does not sync them but it clears the PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages are also not synced by the currently running data integrity sync. We will end up with dirty pages although sync is completed. This could cause a potential data loss when the sync call is followed by a truncate_pagecache call, which is exactly the case in collapse_range. (It will cause generic/127 failure in xfstests) To avoid this issue, we can use set_page_writeback_keepwrite instead of set_page_writeback, which doesn't clear TOWRITE tag. Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm: vmscan: clear kswapd's special reclaim powers before exitingJohannes Weiner1-0/+3
commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream. When kswapd exits, it can end up taking locks that were previously held by allocating tasks while they waited for reclaim. Lockdep currently warns about this: On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: > inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. > kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: > (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130 > {RECLAIM_FS-ON-W} state was registered at: > mark_held_locks+0xb9/0x140 > lockdep_trace_alloc+0x7a/0xe0 > kmem_cache_alloc_trace+0x37/0x240 > flex_array_alloc+0x99/0x1a0 > cgroup_attach_task+0x63/0x430 > attach_task_by_pid+0x210/0x280 > cgroup_procs_write+0x16/0x20 > cgroup_file_write+0x120/0x2c0 > vfs_write+0xc0/0x1f0 > SyS_write+0x4c/0xa0 > tracesys+0xdd/0xe2 > irq event stamp: 49 > hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70 > hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0 > softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0 > softirqs last disabled at (0): (null) > > other info that might help us debug this: > Possible unsafe locking scenario: > > CPU0 > ---- > lock(&sig->group_rwsem); > <Interrupt> > lock(&sig->group_rwsem); > > *** DEADLOCK *** > > no locks held by kswapd2/1151. > > stack backtrace: > CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4 > Call Trace: > dump_stack+0x19/0x1b > print_usage_bug+0x1f7/0x208 > mark_lock+0x21d/0x2a0 > __lock_acquire+0x52a/0xb60 > lock_acquire+0xa2/0x140 > down_read+0x51/0xa0 > exit_signals+0x24/0x130 > do_exit+0xb5/0xa50 > kthread+0xdb/0x100 > ret_from_fork+0x7c/0xb0 This is because the kswapd thread is still marked as a reclaimer at the time of exit. But because it is exiting, nobody is actually waiting on it to make reclaim progress anymore, and it's nothing but a regular thread at this point. Be tidy and strip it of all its powers (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) before returning from the thread function. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm: fix sleeping function warning from __put_anon_vmaHugh Dickins1-3/+5
commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream. Trinity reports BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:47 in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27 __might_sleep < down_write < __put_anon_vma < page_get_anon_vma < migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages .. Right, since conversion to mutex then rwsem, we should not put_anon_vma() from inside an rcu_read_lock()ed section: fix the two places that did so. And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra. Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path") Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm/memory-failure.c: support use of a dedicated thread to handle ↵Naoya Horiguchi1-13/+43
SIGBUS(BUS_MCEERR_AO) commit 3ba08129e38437561df44c36b7ea9081185d5333 upstream. Currently memory error handler handles action optional errors in the deferred manner by default. And if a recovery aware application wants to handle it immediately, it can do it by setting PF_MCE_EARLY flag. However, such signal can be sent only to the main thread, so it's problematic if the application wants to have a dedicated thread to handler such signals. So this patch adds dedicated thread support to memory error handler. We have PF_MCE_EARLY flags for each thread separately, so with this patch AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main thread. If you want to implement a dedicated thread, you call prctl() to set PF_MCE_EARLY on the thread. Memory error handler collects processes to be killed, so this patch lets it check PF_MCE_EARLY flag on each thread in the collecting routines. No behavioral change for all non-early kill cases. Tony said: : The old behavior was crazy - someone with a multithreaded process might : well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then : that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if : that thread wasn't the main thread for the process. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Kamil Iskra <iskra@mcs.anl.gov> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm/memory-failure.c: don't let collect_procs() skip over processes for ↵Tony Luck1-9/+12
MF_ACTION_REQUIRED commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream. When Linux sees an "action optional" machine check (where h/w has reported an error that is not in the current execution path) we generally do not want to signal a process, since most processes do not have a SIGBUS handler - we'd just prematurely terminate the process for a problem that they might never actually see. task_early_kill() decides whether to consider a process - and it checks whether this specific process has been marked for early signals with "prctl", or if the system administrator has requested early signals for all processes using /proc/sys/vm/memory_failure_early_kill. But for MF_ACTION_REQUIRED case we must not defer. The error is in the execution path of the current thread so we must send the SIGBUS immediatley. Fix by passing a flag argument through collect_procs*() to task_early_kill() so it knows whether we can defer or must take action. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm/memory-failure.c-failure: send right signal code to correct threadTony Luck1-2/+2
commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream. When a thread in a multi-threaded application hits a machine check because of an uncorrectable error in memory - we want to send the SIGBUS with si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that if the active thread is not the primary thread in the process. collect_procs() just finds primary threads and this test: if ((flags & MF_ACTION_REQUIRED) && t == current) { will see that the thread we found isn't the current thread and so send a si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active thread at this time). We can fix this by checking whether "current" shares the same mm with the process that collect_procs() said owned the page. If so, we send the SIGBUS to current (with code BUS_MCEERR_AR). Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Borislav Petkov <bp@suse.de> Cc: Chen Gong <gong.chen@linux.jf.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm: page_alloc: use word-based accesses for get/set pageblock bitmapsMel Gorman1-20/+32
commit e58469bafd0524e848c3733bc3918d854595e20f upstream. The test_bit operations in get/set pageblock flags are expensive. This patch reads the bitmap on a word basis and use shifts and masks to isolate the bits of interest. Similarly masks are used to set a local copy of the bitmap and then use cmpxchg to update the bitmap if there have been no other changes made in parallel. In a test running dd onto tmpfs the overhead of the pageblock-related functions went from 1.27% in profiles to 0.5%. In addition to the performance benefits, this patch closes races that are possible between: a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype() reads part of the bits before and other part of the bits after set_pageblock_migratetype() has updated them. b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic read-modify-update set bit operation in set_pageblock_skip() will cause lost updates to some bits changed in the set_pageblock_migratetype(). Joonsoo Kim first reported the case a) via code inspection. Vlastimil Babka's testing with a debug patch showed that either a) or b) occurs roughly once per mmtests' stress-highalloc benchmark (although not necessarily in the same pageblock). Furthermore during development of unrelated compaction patches, it was observed that frequent calls to {start,undo}_isolate_page_range() the race occurs several thousands of times and has resulted in NULL pointer dereferences in move_freepages() and free_one_page() in places where free_list[migratetype] is manipulated by e.g. list_move(). Further debugging confirmed that migratetype had invalid value of 6, causing out of bounds access to the free_list array. That confirmed that the race exist, although it may be extremely rare, and currently only fatal where page isolation is performed due to memory hot remove. Races on pageblocks being updated by set_pageblock_migratetype(), where both old and new migratetype are lower MIGRATE_RESERVE, currently cannot result in an invalid value being observed, although theoretically they may still lead to unexpected creation or destruction of MIGRATE_RESERVE pageblocks. Furthermore, things could get suddenly worse when memory isolation is used more, or when new migratetypes are added. After this patch, the race has no longer been observed in testing. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ↵Mel Gorman1-6/+37
ZONE_NORMAL commit 675becce15f320337499bc1a9356260409a5ba29 upstream. throttle_direct_reclaim() is meant to trigger during swap-over-network during which the min watermark is treated as a pfmemalloc reserve. It throttes on the first node in the zonelist but this is flawed. The user-visible impact is that a process running on CPU whose local memory node has no ZONE_NORMAL will stall for prolonged periods of time, possibly indefintely. This is due to throttle_direct_reclaim thinking the pfmemalloc reserves are depleted when in fact they don't exist on that node. On a NUMA machine running a 32-bit kernel (I know) allocation requests from CPUs on node 1 would detect no pfmemalloc reserves and the process gets throttled. This patch adjusts throttling of direct reclaim to throttle based on the first node in the zonelist that has a usable ZONE_NORMAL or lower zone. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02hugetlb: restrict hugepage_migration_support() to x86_64Naoya Horiguchi1-0/+3
commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream. Currently hugepage migration is available for all archs which support pmd-level hugepage, but testing is done only for x86_64 and there're bugs for other archs. So to avoid breaking such archs, this patch limits the availability strictly to x86_64 until developers of other archs get interested in enabling this feature. Simply disabling hugepage migration on non-x86_64 archs is not enough to fix the reported problem where sys_move_pages() hits the BUG_ON() in follow_page(FOLL_GET), so let's fix this by checking if hugepage migration is supported in vma_migratable(). Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02mm: add !pte_present() check on existing hugetlb_entry callbacksNaoya Horiguchi1-1/+5
commit d4c54919ed86302094c0ca7d48a8cbd4ee753e92 upstream. The age table walker doesn't check non-present hugetlb entry in common path, so hugetlb_entry() callbacks must check it. The reason for this behavior is that some callers want to handle it in its own way. [ I think that reason is bogus, btw - it should just do what the regular code does, which is to call the "pte_hole()" function for such hugetlb entries - Linus] However, some callers don't check it now, which causes unpredictable result, for example when we have a race between migrating hugepage and reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present checks on buggy callbacks. This bug exists for years and got visible by introducing hugepage migration. ChangeLog v2: - fix if condition (check !pte_present() instead of pte_present()) Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Backported to 3.15. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20mm: rmap: fix use-after-free in __put_anon_vmaAndrey Ryabinin1-2/+1
commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream. While working address sanitizer for kernel I've discovered use-after-free bug in __put_anon_vma. For the last anon_vma, anon_vma->root freed before child anon_vma. Later in anon_vma_free(anon_vma) we are referencing to already freed anon_vma->root to check rwsem. This fixes it by freeing the child anon_vma before freeing anon_vma->root. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20mm/memory-failure.c: fix memory leak by race between poison and unpoisonNaoya Horiguchi1-0/+2
commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream. When a memory error happens on an in-use page or (free and in-use) hugepage, the victim page is isolated with its refcount set to one. When you try to unpoison it later, unpoison_memory() calls put_page() for it twice in order to bring the page back to free page pool (buddy or free hugepage list). However, if another memory error occurs on the page which we are unpoisoning, memory_failure() returns without releasing the refcount which was incremented in the same call at first, which results in memory leak and unconsistent num_poisoned_pages statistics. This patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20mm/compaction: make isolate_freepages start at pageblock boundaryVlastimil Babka1-10/+12
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream. The compaction freepage scanner implementation in isolate_freepages() starts by taking the current cc->free_pfn value as the first pfn. In a for loop, it scans from this first pfn to the end of the pageblock, and then subtracts pageblock_nr_pages from the first pfn to obtain the first pfn for the next for loop iteration. This means that when cc->free_pfn starts at offset X rather than being aligned on pageblock boundary, the scanner will start at offset X in all scanned pageblock, ignoring potentially many free pages. Currently this can happen when a) zone's end pfn is not pageblock aligned, or b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE enabled and a hole spanning the beginning of a pageblock This patch fixes the problem by aligning the initial pfn in isolate_freepages() to pageblock boundary. This also permits replacing the end-of-pageblock alignment within the for loop with a simple pageblock_nr_pages increment. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20mm: compaction: detect when scanners meet in isolate_freepagesVlastimil Babka1-4/+15
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream. Compaction of a zone is finished when the migrate scanner (which begins at the zone's lowest pfn) meets the free page scanner (which begins at the zone's highest pfn). This is detected in compact_zone() and in the case of direct compaction, the compact_blockskip_flush flag is set so that kswapd later resets the cached scanner pfn's, and a new compaction may again start at the zone's borders. The meeting of the scanners can happen during either scanner's activity. However, it may currently fail to be detected when it occurs in the free page scanner, due to two problems. First, isolate_freepages() keeps free_pfn at the highest block where it isolated pages from, for the purposes of not missing the pages that are returned back to allocator when migration fails. Second, failing to isolate enough free pages due to scanners meeting results in -ENOMEM being returned by migrate_pages(), which makes compact_zone() bail out immediately without calling compact_finished() that would detect scanners meeting. This failure to detect scanners meeting might result in repeated attempts at compaction of a zone that keep starting from the cached pfn's close to the meeting point, and quickly failing through the -ENOMEM path, without the cached pfns being reset, over and over. This has been observed (through additional tracepoints) in the third phase of the mmtests stress-highalloc benchmark, where the allocator runs on an otherwise idle system. The problem was observed in the DMA32 zone, which was used as a fallback to the preferred Normal zone, but on the 4GB system it was actually the largest zone. The problem is even amplified for such fallback zone - the deferred compaction logic, which could (after being fixed by a previous patch) reset the cached scanner pfn's, is only applied to the preferred zone and not for the fallbacks. The problem in the third phase of the benchmark was further amplified by commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which resulted in a non-deterministic regression of the allocation success rate from ~85% to ~65%. This occurs in about half of benchmark runs, making bisection problematic. It is unlikely that the commit itself is buggy, but it should put more pressure on the DMA32 zone during phases 1 and 2, which may leave it more fragmented in phase 3 and expose the bugs that this patch fixes. The fix is to make scanners meeting in isolate_freepage() stay that way, and to check in compact_zone() for scanners meeting when migrate_pages() returns -ENOMEM. The result is that compact_finished() also detects scanners meeting and sets the compact_blockskip_flush flag to make kswapd reset the scanner pfn's. The results in stress-highalloc benchmark show that the "regression" by commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2 allocation success rates are also significantly improved. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20mm: compaction: reset cached scanner pfn's before reading themVlastimil Babka1-8/+8
commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream. Compaction caches pfn's for its migrate and free scanners to avoid scanning the whole zone each time. In compact_zone(), the cached values are read to set up initial values for the scanners. There are several situations when these cached pfn's are reset to the first and last pfn of the zone, respectively. One of these situations is when a compaction has been deferred for a zone and is now being restarted during a direct compaction, which is also done in compact_zone(). However, compact_zone() currently reads the cached pfn's *before* resetting them. This means the reset doesn't affect the compaction that performs it, and with good chance also subsequent compactions, as update_pageblock_skip() is likely to be called and update the cached pfn's to those being processed. Another chance for a successful reset is when a direct compaction detects that migration and free scanners meet (which has its own problems addressed by another patch) and sets update_pageblock_skip flag which kswapd uses to do the reset because it goes to sleep. This is clearly a bug that results in non-deterministic behavior, so this patch moves the cached pfn reset to be performed *before* the values are read. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-09percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan1-1/+1
commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream. pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-09mm/page-writeback.c: fix divide by zero in pos_ratio_polynomRik van Riel1-3/+3
commit d5c9fde3dae750889168807038243ff36431d276 upstream. It is possible for "limit - setpoint + 1" to equal zero, after getting truncated to a 32 bit variable, and resulting in a divide by zero error. Using the fully 64 bit divide functions avoids this problem. It also will cause pos_ratio_polynom() to return the correct value when (setpoint - limit) exceeds 2^32. Also uninline pos_ratio_polynom, at Andrew's request. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-06hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free ↵Chen Yucong1-7/+8
hugepage commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream. For handling a free hugepage in memory failure, the race will happen if another thread hwpoisoned this hugepage concurrently. So we need to check PageHWPoison instead of !PageHWPoison. If hwpoison_filter(p) returns true or a race happens, then we need to unlock_page(hpage). Signed-off-by: Chen Yucong <slaoub@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-06mm, thp: close race between mremap() and split_huge_page()Kirill A. Shutemov1-1/+8
commit dd18dbc2d42af75fffa60c77e0f02220bc329829 upstream. It's critical for split_huge_page() (and migration) to catch and freeze all PMDs on rmap walk. It gets tricky if there's concurrent fork() or mremap() since usually we copy/move page table entries on dup_mm() or move_page_tables() without rmap lock taken. To get it work we rely on rmap walk order to not miss any entry. We expect to see destination VMA after source one to work correctly. But after switching rmap implementation to interval tree it's not always possible to preserve expected walk order. It works fine for dup_mm() since new VMA has the same vma_start_pgoff() / vma_last_pgoff() and explicitly insert dst VMA after src one with vma_interval_tree_insert_after(). But on move_vma() destination VMA can be merged into adjacent one and as result shifted left in interval tree. Fortunately, we can detect the situation and prevent race with rmap walk by moving page table entries under rmap lock. See commit 38a76013ad80. Problem is that we miss the lock when we move transhuge PMD. Most likely this bug caused the crash[1]. [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473 Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: David Miller <davem@davemloft.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-06mm: make fixup_user_fault() check the vma access rights tooLinus Torvalds1-0/+5
commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream. fixup_user_fault() is used by the futex code when the direct user access fails, and the futex code wants it to either map in the page in a usable form or return an error. It relied on handle_mm_fault() to map the page, and correctly checked the error return from that, but while that does map the page, it doesn't actually guarantee that the page will be mapped with sufficient permissions to be then accessed. So do the appropriate tests of the vma access rights by hand. [ Side note: arguably handle_mm_fault() could just do that itself, but we have traditionally done it in the caller, because some callers - notably get_user_pages() - have been able to access pages even when they are mapped with PROT_NONE. Maybe we should re-visit that design decision, but in the meantime this is the minimal patch. ] Found by Dave Jones running his trinity tool. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-04mm, oom: prefer thread group leaders for display purposesDavid Rientjes2-11/+20
commit d49ad9355420c743c736bfd1dee9eaa5b1a7722a upstream. When two threads have the same badness score, it's preferable to kill the thread group leader so that the actual process name is printed to the kernel log rather than the thread group name which may be shared amongst several processes. This was the behavior when select_bad_process() used to do for_each_process(), but it now iterates threads instead and leads to ambiguity. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz>