summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2012-08-01mm, fadvise: don't return -EINVAL when filesystem cannot implement fadvise()KOSAKI Motohiro1-11/+7
Eric Wong reported his test suite failex when /tmp is tmpfs. https://lkml.org/lkml/2012/2/24/479 Currentlt the input check of POSIX_FADV_WILLNEED has two problems. - requires a_ops->readpage. But in fact, force_page_cache_readahead() requires that the target filesystem has either ->readpage or ->readpages. - returns -EINVAL when the filesystem doesn't have ->readpage. But posix says that fadvise is merely a hint. Thus fadvise() should return 0 if filesystem has no means of implementing fadvise(). The userland application should not know nor care whcih type of filesystem backs the TMPDIR directory, as Eric pointed out. There is nothing which userspace can do to solve this error. So change the return value to 0 when filesytem doesn't support readahead. [akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: Eric Wong <normalperson@yhbt.net> Tested-by: Eric Wong <normalperson@yhbt.net> Reviewed-by: Wanlong Gao <gaowanlong@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: make mem_cgroup_force_empty_list() return boolKAMEZAWA Hiroyuki1-12/+7
mem_cgroup_force_empty_list() just returns 0 or -EBUSY and -EBUSY indicates 'you need to retry'. Make mem_cgroup_force_empty_list() return a bool to simplify the logic. [akpm@linux-foundation.org: rework mem_cgroup_force_empty_list()'s comment] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: mem_cgroup_move_parent() doesn't need gfp_maskKAMEZAWA Hiroyuki1-3/+2
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: clean up force_empty_list() return value checkKamezawa Hiroyuki1-5/+0
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()" mem_cgroup_move_parent() returns only -EBUSY or -EINVAL. So we can remove the -ENOMEM and -EINTR checks. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: remove check for signal_pending() during rmdir()Kamezawa Hiroyuki1-3/+0
After bf544fdc241da8 "memcg: move charges to root cgroup if use_hierarchy=0 in mem_cgroup_move_hugetlb_parent()", no memory reclaim will occur when removing a memory cgroup. If -EINTR is returned here, cgroup will show a warning. We don't need to handle any user interruption signal. Remove this. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm/memblock.c:memblock_double_array(): cosmetic cleanupsAndrew Morton1-17/+18
This function is an 80-column eyesore, quite unnecessarily. Clean that up, and use standard comment layout style. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Greg Pearson <greg.pearson@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm, oom: do not schedule if current has been killedDavid Rientjes1-6/+5
The oom killer currently schedules away from current in an uninterruptible sleep if it does not have access to memory reserves. It's possible that current was killed because it shares memory with the oom killed thread or because it was killed by the user in the interim, however. This patch only schedules away from current if it does not have a pending kill, i.e. if it does not share memory with the oom killed thread. It's possible that it will immediately retry its memory allocation and fail, but it will immediately be given access to memory reserves if it calls the oom killer again. This prevents the delay of memory freeing when threads that share memory with the oom killed thread get unnecessarily scheduled. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: remove exclude and wakeup rmdir calls from migrateAneesh Kumar K.V1-2/+4
We already hold the hugetlb_lock. That should prevent a parallel cgroup rmdir from touching page's hugetlb cgroup. So remove the exclude and wakeup calls. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: assign the page hugetlb cgroup when we move the page to ↵Aneesh Kumar K.V2-14/+13
active list. A page's hugetlb cgroup assignment and movement to the active list should occur with hugetlb_lock held. Otherwise when we remove the hugetlb cgroup we will iterate the active list and find pages with NULL hugetlb cgroup values. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: move all the in use pages to active listAneesh Kumar K.V1-1/+10
When we fail to allocate pages from the reserve pool, hugetlb tries to allocate huge pages using alloc_buddy_huge_page. Add these to the active list. We also need to add the huge page we allocate when we soft offline the oldpage to active list. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: migrate hugetlb cgroup info from oldpage to new page during ↵Aneesh Kumar K.V2-0/+25
migration With HugeTLB pages, hugetlb cgroup is uncharged in compound page destructor. Since we are holding a hugepage reference, we can be sure that old page won't get uncharged till the last put_page(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: add hugetlb cgroup control filesAneesh Kumar K.V2-0/+137
Add the control files for hugetlb controller [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: add support for cgroup removalAneesh Kumar K.V1-2/+68
Add support for cgroup removal. If we don't have parent cgroup, the charges are moved to root cgroup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: add charge/uncharge routines for hugetlb cgroupAneesh Kumar K.V2-1/+95
Add the charge and uncharge routines for hugetlb cgroup. We do cgroup charging in page alloc and uncharge in compound page destructor. Assigning page's hugetlb cgroup is protected by hugetlb_lock. [liwp@linux.vnet.ibm.com: add huge_page_order check to avoid incorrect uncharge] Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb/cgroup: add the cgroup pointer to page lruAneesh Kumar K.V1-0/+4
Add the hugetlb cgroup pointer to 3rd page lru.next. This limit the usage to hugetlb cgroup to only hugepages with 3 or more normal pages. I guess that is an acceptable limitation. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm/hugetlb: add new HugeTLB cgroupAneesh Kumar K.V2-0/+121
Implement a new controller that allows us to control HugeTLB allocations. The extension allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The charge/uncharge calls will be added to HugeTLB code in later patch. Support for cgroup removal will be added in later patches. [akpm@linux-foundation.org: s/CONFIG_CGROUP_HUGETLB_RES_CTLR/CONFIG_MEMCG_HUGETLB/g] [akpm@linux-foundation.org: s/CONFIG_MEMCG_HUGETLB/CONFIG_CGROUP_HUGETLB/g] Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: make some static variables globalAneesh Kumar K.V1-5/+2
We will use them later in hugetlb_cgroup.c Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: add a list for tracking in-use HugeTLB pagesAneesh Kumar K.V1-5/+7
hugepage_activelist will be used to track currently used HugeTLB pages. We need to find the in-use HugeTLB pages to support HugeTLB cgroup removal. On cgroup removal we update the page's HugeTLB cgroup to point to parent cgroup. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: simplify migrate_huge_page()Aneesh Kumar K.V2-55/+25
Since we migrate only one hugepage, don't use linked list for passing the page around. Directly pass the page that need to be migrated as argument. This also removes the usage of page->lru in the migrate path. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: use mmu_gather instead of a temporary linked list for accumulating ↵Aneesh Kumar K.V2-26/+40
pages Use a mmu_gather instead of a temporary linked list for accumulating pages when we unmap a hugepage range Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: add an inline helper for finding hstate indexAneesh Kumar K.V1-9/+11
Add an inline helper and use it in the code. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: don't use ERR_PTR with VM_FAULT* valuesAneesh Kumar K.V1-5/+13
The current use of VM_FAULT_* codes with ERR_PTR requires us to ensure VM_FAULT_* values will not exceed MAX_ERRNO value. Decouple the VM_FAULT_* values from MAX_ERRNO. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01hugetlb: rename max_hstate to hugetlb_max_hstateAneesh Kumar K.V1-7/+7
This patchset implements a cgroup resource controller for HugeTLB pages. The controller allows to limit the HugeTLB usage per control group and enforces the controller limit during page fault. Since HugeTLB doesn't support page reclaim, enforcing the limit at page fault time implies that, the application will get SIGBUS signal if it tries to access HugeTLB pages beyond its limit. This requires the application to know beforehand how much HugeTLB pages it would require for its use. The goal is to control how many HugeTLB pages a group of task can allocate. It can be looked at as an extension of the existing quota interface which limits the number of HugeTLB pages per hugetlbfs superblock. HPC job scheduler requires jobs to specify their resource requirements in the job file. Once their requirements can be met, job schedulers like (SLURM) will schedule the job. We need to make sure that the jobs won't consume more resources than requested. If they do we should either error out or kill the application. This patch: Rename max_hstate to hugetlb_max_hstate. We will be using this from other subsystems like hugetlb controller in later patches. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hillf Danton <dhillf@gmail.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threadsWanpeng Li1-0/+20
Since per-BDI flusher threads were introduced in 2.6, the pdflush mechanism is not used any more. But the old interface exported through /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless. For back-compatibility, printk warning information and return 2 to notify the users that the interface is removed. Signed-off-by: Wanpeng Li <liwp@linux.vnet.ibm.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm/buddy: cleanup on should_fail_alloc_pageGavin Shan1-7/+7
Currently, function should_fail() has "bool" for its return value, so it's reasonable to change the return value of function should_fail_alloc_page() into "bool" as well. The patch does cleanup on function should_fail_alloc_page() to have "bool" for its return value. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: account the total_vm in the vm_stat_account()Huang Shijie2-5/+2
vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now. But we can also account for total_vm in the vm_stat_account() which makes the code tidy. Even for mprotect_fixup(), we can get the right result in the end. Signed-off-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01swap: allow swap readahead to be mergedChristian Ehrhardt1-0/+5
Swap readahead works fine, but the I/O to disk is almost always done in page size requests, despite the fact that readahead submits 1<<page-cluster pages at a time. On older kernels the old per device plugging behavior might have captured this and merged the requests, but currently all comes down to much more I/Os than required. On a single device this might not be an issue, but as soon as a server runs on shared san resources savin I/Os not only improves swapin throughput but also provides a lower resource utilization. With a load running KVM in a lot of memory overcommitment (the hot memory is 1.5 times the host memory) swapping throughput improves significantly and the lead feels more responsive as well as achieves more throughput. In a test setup with 16 swap disks running blocktrace on one of those disks shows the improved merging: Prior: Reads Queued: 560,888, 2,243MiB Writes Queued: 226,242, 904,968KiB Read Dispatches: 544,701, 2,243MiB Write Dispatches: 159,318, 904,968KiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 544,716, 2,243MiB Writes Completed: 159,321, 904,980KiB Read Merges: 16,187, 64,748KiB Write Merges: 61,744, 246,976KiB IO unplugs: 149,614 Timer unplugs: 2,940 With the patch: Reads Queued: 734,315, 2,937MiB Writes Queued: 300,188, 1,200MiB Read Dispatches: 214,972, 2,937MiB Write Dispatches: 215,176, 1,200MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 214,971, 2,937MiB Writes Completed: 215,177, 1,200MiB Read Merges: 519,343, 2,077MiB Write Merges: 73,325, 293,300KiB IO unplugs: 337,130 Timer unplugs: 11,184 I got ~10% to ~40% more throughput in my cases and at the same time much lower cpu consumption when broken down per transferred kilobyte (the majority of that due to saved interrupts and better cache handling). In a shared SAN others might get an additional benefit as well, because this now causes less protocol overhead. Signed-off-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: remove MEM_CGROUP_CHARGE_TYPE_FORCEKamezawa Hiroyuki1-1/+0
There are no users since commit b24028572fb69 ("memcg: remove PCG_CACHE"). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: rename MEM_CGROUP_CHARGE_TYPE_MAPPED as MEM_CGROUP_CHARGE_TYPE_ANONKamezawa Hiroyuki1-8/+8
Now, in memcg, 2 "MAPPED" enum/macro are found MEM_CGROUP_CHARGE_TYPE_MAPPED MEM_CGROUP_STAT_FILE_MAPPED Thier names looks similar to each other but the former is used for accounting anonymous memory. rename it as TYPE_ANON. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01memcg: rename MEM_CGROUP_STAT_SWAPOUT as MEM_CGROUP_STAT_SWAPKamezawa Hiroyuki1-5/+5
MEM_CGROUP_STAT_SWAPOUT represents the usage of swap rather than the number of swap-out events. Rename it to be MEM_CGROUP_STAT_SWAP. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: make vb_alloc() more foolproofJan Kara1-0/+8
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate 0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT and interesting stuff happens. So make debugging such problems easier and warn about 0-size allocation. [akpm@linux-foundation.org: use WARN_ON-return-value feature] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01vmalloc: walk vmap_areas by sorted list instead of rb_next()Hong zhi guo1-4/+4
There's a walk by repeating rb_next to find a suitable hole. Could be simply replaced by walk on the sorted vmap_area_list. More simpler and efficient. Mutation of the list and tree only happens in pair within __insert_vmap_area and __free_vmap_area, under protection of vmap_area_lock. The patch code is also under vmap_area_lock, so the list walk is safe, and consistent with the tree walk. Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and rounds for hours. Signed-off-by: Hong Zhiguo <honkiko@gmail.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31Merge tag 'writeback-proportions' of ↵Linus Torvalds2-44/+69
git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback updates from Wu Fengguang: "Use time based periods to age the writeback proportions, which can adapt equally well to fast/slow devices." Fix up trivial conflict in comment in fs/sync.c * tag 'writeback-proportions' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Fix some comment errors block: Convert BDI proportion calculations to flexible proportions lib: Fix possible deadlock in flexible proportion code lib: Proportions with flexible period
2012-07-31Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds1-3/+3
Merge Andrew's first set of patches: "Non-MM patches: - lots of misc bits - tree-wide have_clk() cleanups - quite a lot of printk tweaks. I draw your attention to "printk: convert the format for KERN_<LEVEL> to a 2 byte pattern" which looks a bit scary. But afaict it's solid. - backlight updates - lib/ feature work (notably the addition and use of memweight()) - checkpatch updates - rtc updates - nilfs updates - fatfs updates (partial, still waiting for acks) - kdump, proc, fork, IPC, sysctl, taskstats, pps, etc - new fault-injection feature work" * Merge emailed patches from Andrew Morton <akpm@linux-foundation.org>: (128 commits) drivers/misc/lkdtm.c: fix missing allocation failure check lib/scatterlist: do not re-write gfp_flags in __sg_alloc_table() fault-injection: add tool to run command with failslab or fail_page_alloc fault-injection: add selftests for cpu and memory hotplug powerpc: pSeries reconfig notifier error injection module memory: memory notifier error injection module PM: PM notifier error injection module cpu: rewrite cpu-notifier-error-inject module fault-injection: notifier error injection c/r: fcntl: add F_GETOWNER_UIDS option resource: make sure requested range is included in the root range include/linux/aio.h: cpp->C conversions fs: cachefiles: add support for large files in filesystem caching pps: return PTR_ERR on error in device_create taskstats: check nla_reserve() return sysctl: suppress kmemleak messages ipc: use Kconfig options for __ARCH_WANT_[COMPAT_]IPC_PARSE_VERSION ipc: compat: use signed size_t types for msgsnd and msgrcv ipc: allow compat IPC version field parsing if !ARCH_WANT_OLD_COMPAT_IPC ipc: add COMPAT_SHMLBA support ...
2012-07-31mm: fix wrong argument of migrate_huge_pages() in soft_offline_huge_page()Joonsoo Kim1-3/+3
Commit a6bc32b89922 ("mm: compaction: introduce sync-light migration for use by compaction") changed the declaration of migrate_pages() and migrate_huge_pages(). But it missed changing the argument of migrate_huge_pages() in soft_offline_huge_page(). In this case, we should call migrate_huge_pages() with MIGRATE_SYNC. Additionally, there is a mismatch between type the of argument and the function declaration for migrate_pages(). Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-30Merge branch 'slab/next' of ↵Linus Torvalds7-599/+559
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "Most of the changes included are from Christoph Lameter's "common slab" patch series that unifies common parts of SLUB, SLAB, and SLOB allocators. The unification is needed for Glauber Costa's "kmem memcg" work that will hopefully appear for v3.7. The rest of the changes are fixes and speedups by various people." * 'slab/next' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: (32 commits) mm: Fix build warning in kmem_cache_create() slob: Fix early boot kernel crash mm, slub: ensure irqs are enabled for kmemcheck mm, sl[aou]b: Move kmem_cache_create mutex handling to common code mm, sl[aou]b: Use a common mutex definition mm, sl[aou]b: Common definition for boot state of the slab allocators mm, sl[aou]b: Extract common code for kmem_cache_create() slub: remove invalid reference to list iterator variable mm: Fix signal SIGFPE in slabinfo.c. slab: move FULL state transition to an initcall slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro" mm, slab: Build fix for recent kmem_cache changes slab: rename gfpflags to allocflags slub: refactoring unfreeze_partials() slub: use __cmpxchg_double_slab() at interrupt disabled place slab/mempolicy: always use local policy from interrupt context slab: Get rid of obj_size macro mm, sl[aou]b: Extract common fields from struct kmem_cache slab: Remove some accessors slab: Use page struct fields instead of casting ...
2012-07-30Merge branch 'for-linus-for-3.6-rc1' of ↵Linus Torvalds1-10/+18
git://git.linaro.org/people/mszyprowski/linux-dma-mapping Pull DMA-mapping updates from Marek Szyprowski: "Those patches are continuation of my earlier work. They contains extensions to DMA-mapping framework to remove limitation of the current ARM implementation (like limited total size of DMA coherent/write combine buffers), improve performance of buffer sharing between devices (attributes to skip cpu cache operations or creation of additional kernel mapping for some specific use cases) as well as some unification of the common code for dma_mmap_attrs() and dma_mmap_coherent() functions. All extensions have been implemented and tested for ARM architecture." * 'for-linus-for-3.6-rc1' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: ARM: dma-mapping: add support for DMA_ATTR_SKIP_CPU_SYNC attribute common: DMA-mapping: add DMA_ATTR_SKIP_CPU_SYNC attribute ARM: dma-mapping: add support for dma_get_sgtable() common: dma-mapping: introduce dma_get_sgtable() function ARM: dma-mapping: add support for DMA_ATTR_NO_KERNEL_MAPPING attribute common: DMA-mapping: add DMA_ATTR_NO_KERNEL_MAPPING attribute common: dma-mapping: add support for generic dma_mmap_* calls ARM: dma-mapping: fix error path for memory allocation failure ARM: dma-mapping: add more sanity checks in arm_dma_mmap() ARM: dma-mapping: remove custom consistent dma region mm: vmalloc: use const void * for caller argument scatterlist: add sg_alloc_table_from_pages function
2012-07-30ARM: dma-mapping: remove custom consistent dma regionMarek Szyprowski1-1/+9
This patch changes dma-mapping subsystem to use generic vmalloc areas for all consistent dma allocations. This increases the total size limit of the consistent allocations and removes platform hacks and a lot of duplicated code. Atomic allocations are served from special pool preallocated on boot, because vmalloc areas cannot be reliably created in atomic context. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Minchan Kim <minchan@kernel.org>
2012-07-30mm: vmalloc: use const void * for caller argumentMarek Szyprowski1-9/+9
'const void *' is a safer type for caller function type. This patch updates all references to caller function type. Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com> Reviewed-by: Minchan Kim <minchan@kernel.org>
2012-07-30mm: Fix build warning in kmem_cache_create()Shuah Khan1-0/+2
The label oops is used in CONFIG_DEBUG_VM ifdef block and is defined outside ifdef CONFIG_DEBUG_VM block. This results in the following build warning when built with CONFIG_DEBUG_VM disabled. Fix to move label oops definition to inside a CONFIG_DEBUG_VM block. mm/slab_common.c: In function ‘kmem_cache_create’: mm/slab_common.c:101:1: warning: label ‘oops’ defined but not used [-Wunused-label] Signed-off-by: Shuah Khan <shuah.khan@hp.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-27Merge branch 'kmap_atomic' of git://github.com/congwang/linuxLinus Torvalds1-6/+2
Pull final kmap_atomic cleanups from Cong Wang: "This should be the final round of cleanup, as the definitions of enum km_type finally get removed from the whole tree. The patches have been in linux-next for a long time." * 'kmap_atomic' of git://github.com/congwang/linux: pipe: remove KM_USER0 from comments vmalloc: remove KM_USER0 from comments feature-removal-schedule.txt: remove kmap_atomic(page, km_type) tile: remove km_type definitions um: remove km_type definitions asm-generic: remove km_type definitions avr32: remove km_type definitions frv: remove km_type definitions powerpc: remove km_type definitions arm: remove km_type definitions highmem: remove the deprecated form of kmap_atomic tile: remove usage of enum km_type frv: remove the second parameter of kmap_atomic_primary() jbd2: remove the second argument of kmap_atomic
2012-07-27Merge branch 'x86-mm-for-linus' of ↵Linus Torvalds1-0/+9
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/mm changes from Peter Anvin: "The big change here is the patchset by Alex Shi to use INVLPG to flush only the affected pages when we only need to flush a small page range. It also removes the special INVALIDATE_TLB_VECTOR interrupts (32 vectors!) and replace it with an ordinary IPI function call." Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next to changed line) * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/tlb: Fix build warning and crash when building for !SMP x86/tlb: do flush_tlb_kernel_range by 'invlpg' x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR x86/tlb: enable tlb flush range support for x86 mm/mmu_gather: enable tlb flush range in generic mmu_gather x86/tlb: add tlb_flushall_shift knob into debugfs x86/tlb: add tlb_flushall_shift for specific CPU x86/tlb: fall back to flush all when meet a THP large page x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range x86/tlb_info: get last level TLB entry number of CPU x86: Add read_mostly declaration/definition to variables from smp.h x86: Define early read-mostly per-cpu macros
2012-07-25Merge branch 'for-linus' of ↵Linus Torvalds1-2/+3
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Trivial updates all over the place as usual." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits) Fix typo in include/linux/clk.h . pci: hotplug: Fix typo in pci iommu: Fix typo in iommu video: Fix typo in drivers/video Documentation: Add newline at end-of-file to files lacking one arm,unicore32: Remove obsolete "select MISC_DEVICES" module.c: spelling s/postition/position/g cpufreq: Fix typo in cpufreq driver trivial: typo in comment in mksysmap mach-omap2: Fix typo in debug message and comment scsi: aha152x: Fix sparse warning and make printing pointer address more portable. Change email address for Steve Glendinning Btrfs: fix typo in convert_extent_bit via: Remove bogus if check netprio_cgroup.c: fix comment typo backlight: fix memory leak on obscure error path Documentation: asus-laptop.txt references an obsolete Kconfig item Documentation: ManagementStyle: fixed typo mm/vmscan: cleanup comment error in balance_pgdat mm: cleanup on the comments of zone_reclaim_stat ...
2012-07-25Merge tag 'stable/for-linus-3.6-rc0-tag' of ↵Linus Torvalds1-60/+90
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm Pull frontswap updates from Konrad Rzeszutek Wilk: "Cleanups in code and documentation. Little bit of refactoring for cleaner look." * tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: mm/frontswap: cleanup doc and comment error mm: frontswap: remove unneeded headers mm: frontswap: split out function to clear a page out mm: frontswap: remove unnecessary check during initialization mm: frontswap: make all branches of if statement in put page consistent mm: frontswap: split frontswap_shrink further to simplify locking mm: frontswap: split out __frontswap_unuse_pages mm: frontswap: split out __frontswap_curr_pages mm: frontswap: trivial coding convention issues mm: frontswap: remove casting from function calls through ops structure
2012-07-24vmalloc: remove KM_USER0 from commentsCong Wang1-6/+2
Signed-off-by: Cong Wang <amwang@redhat.com>
2012-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tileLinus Torvalds1-3/+5
Pull arch/tile updates from Chris Metcalf: "These changes provide support for PCIe root complex and USB host mode for tilegx's on-chip I/Os. In addition, this pull provides the required underpinning for the on-chip networking support that was pulled into 3.5. The changes have all been through LKML (with several rounds for PCIe RC) and on linux-next." * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: tile: updates to pci root complex from community feedback bounce: allow use of bounce pool via config option usb: add host support for the tilegx architecture arch/tile: provide kernel support for the tilegx USB shim tile pci: enable IOMMU to support DMA for legacy devices arch/tile: enable ZONE_DMA for tilegx tilegx pci: support I/O to arbitrarily-cached pages tile: remove unused header arch/tile: tilegx PCI root complex support arch/tile: provide kernel support for the tilegx TRIO shim arch/tile: break out the "csum a long" function to <asm/checksum.h> arch/tile: provide kernel support for the tilegx mPIPE shim arch/tile: common DMA code for the GXIO IORPC subsystem arch/tile: support MMIO-based readb/writeb etc. arch/tile: introduce GXIO IORPC framework for tilegx
2012-07-23Merge branch 'for-linus-2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is *big* and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file *, it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata * anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller *and* not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...
2012-07-23mm/frontswap: cleanup doc and comment errorWanpeng Li1-1/+1
Signed-off-by: Wanpeng Li <liwp.linux@gmail.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-23mm: frontswap: remove unneeded headersSasha Levin1-4/+0
Signed-off-by: Sasha Levin <levinsasha928@gmail.com> [v1: Rebased with tracing removed] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2012-07-23Merge branch 'x86-mce-for-linus' of ↵Linus Torvalds1-6/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/mce changes from Ingo Molnar: "This tree improves the AMD thresholding bank code and includes a memory fault signal handling fixlet." * 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults x86, MCE, AMD: Update copyrights and boilerplate x86, MCE, AMD: Give proper names to the thresholding banks x86, MCE, AMD: Make error_count read only x86, MCE, AMD: Cleanup reading of error_count x86, MCE, AMD: Print decimal thresholding values x86, MCE, AMD: Move shared bank to node descriptor x86, MCE, AMD: Remove local_allocate_... wrapper x86, MCE, AMD: Remove shared banks sysfs linking x86, amd_nb: Export model 0x10 and later PCI id