summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2007-10-16Generic Virtual Memmap support for SPARSEMEMChristoph Lameter3-4/+199
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. [apw@shadowen.org: config cleanups, resplit code etc] [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init] [apw@shadowen.org: vmemmap: remove excess debugging] [apw@shadowen.org: simplify initialisation code and reduce duplication] [apw@shadowen.org: pull out the vmemmap code into its own file] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16sparsemem: record when a section has a valid mem_mapAndy Whitcroft1-4/+5
We have flags to indicate whether a section actually has a valid mem_map associated with it. This is never set and we rely solely on the present bit to indicate a section is valid. By definition a section is not valid if it has no mem_map and there is a window during init where the present bit is set but there is no mem_map, during which pfn_valid() will return true incorrectly. Use the existing SECTION_HAS_MEM_MAP flag to indicate the presence of a valid mem_map. Switch valid_section{,_nr} and pfn_valid() to this bit. Add a new present_section{,_nr} and pfn_present() interfaces for those users who care to know that a section is going to be valid. [akpm@linux-foundation.org: coding-syle fixes] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <clameter@sgi.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16sparsemem: clean up spelling error in commentsAndy Whitcroft1-1/+1
SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all the arches. It would be great if it could be the default so that we can get rid of various forms of DISCONTIG and other variations on memory maps. So far what has hindered this are the additional lookups that SPARSEMEM introduces for virt_to_page and page_address. This goes so far that the code to do this has to be kept in a separate function and cannot be used inline. This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap is mapped into a virtually contigious area, only the active sections are physically backed. This allows virt_to_page page_address and cohorts become simple shift/add operations. No page flag fields, no table lookups, nothing involving memory is required. The two key operations pfn_to_page and page_to_page become: #define __pfn_to_page(pfn) (vmemmap + (pfn)) #define __page_to_pfn(page) ((page) - vmemmap) By having a virtual mapping for the memmap we allow simple access without wasting physical memory. As kernel memory is typically already mapped 1:1 this introduces no additional overhead. The virtual mapping must be big enough to allow a struct page to be allocated and mapped for all valid physical pages. This vill make a virtual memmap difficult to use on 32 bit platforms that support 36 address bits. However, if there is enough virtual space available and the arch already maps its 1-1 kernel space using TLBs (f.e. true of IA64 and x86_64) then this technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM. FLATMEM needs to read the contents of the mem_map variable to get the start of the memmap and then add the offset to the required entry. vmemmap is a constant to which we can simply add the offset. This patch has the potential to allow us to make SPARSMEM the default (and even the only) option for most systems. It should be optimal on UP, SMP and NUMA on most platforms. Then we may even be able to remove the other memory models: FLATMEM, DISCONTIG etc. The current aim is to bring a common virtually mapped mem_map to all architectures. This should facilitate the removal of the bespoke implementations from the architectures. This also brings performance improvements for most architecture making sparsmem vmemmap the more desirable memory model. The ultimate aim of this work is to expand sparsemem support to encompass all the features of the other memory models. This could allow us to drop support for and remove the other models in the longer term. Below are some comparitive kernbench numbers for various architectures, comparing default memory model against SPARSEMEM VMEMMAP. All but ia64 show marginal improvement; we expect the ia64 figures to be sorted out when the larger mapping support returns. x86-64 non-NUMA Base VMEMAP % change (-ve good) User 85.07 84.84 -0.26 System 34.32 33.84 -1.39 Total 119.38 118.68 -0.59 ia64 Base VMEMAP % change (-ve good) User 1016.41 1016.93 0.05 System 50.83 51.02 0.36 Total 1067.25 1067.95 0.07 x86-64 NUMA Base VMEMAP % change (-ve good) User 30.77 431.73 0.22 System 45.39 43.98 -3.11 Total 476.17 475.71 -0.10 ppc64 Base VMEMAP % change (-ve good) User 488.77 488.35 -0.09 System 56.92 56.37 -0.97 Total 545.69 544.72 -0.18 Below are some AIM bencharks on IA64 and x86-64 (thank Bob). The seems pretty much flat as you would expect. ia64 results 2 cpu non-numa 4Gb SCSI disk Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 1 07:17:24 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 98.9 100 58.9 1.3 1.6482 101 5547.1 95 106.0 79.4 0.9154 201 6377.7 95 183.4 158.3 0.5288 301 6932.2 95 252.7 237.3 0.3838 401 7075.8 93 329.8 316.7 0.2941 501 7235.6 94 403.0 396.2 0.2407 600 7387.5 94 472.7 475.0 0.2052 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 1 09:59:04 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 99.1 100 58.8 1.2 1.6509 101 5480.9 95 107.2 79.2 0.9044 201 6490.3 95 180.2 157.8 0.5382 301 6886.6 94 254.4 236.8 0.3813 401 7078.2 94 329.7 316.0 0.2942 501 7250.3 95 402.2 395.4 0.2412 600 7399.1 94 471.9 473.9 0.2055 open power 710 2 cpu, 4 Gb, SCSI and configured physically Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme May 29 15:42:53 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.7 100 226.3 4.3 0.4286 101 1096.0 97 536.4 199.8 0.1809 201 1236.4 96 946.1 389.1 0.1025 301 1280.5 96 1368.0 582.3 0.0709 401 1270.2 95 1837.4 771.0 0.0528 501 1251.4 96 2330.1 955.9 0.0416 601 1252.6 96 2792.4 1139.2 0.0347 701 1245.2 96 3276.5 1334.6 0.0296 918 1229.5 96 4345.4 1728.7 0.0223 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap May 30 07:28:26 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 25.6 100 226.9 4.3 0.4275 101 1049.3 97 560.2 198.1 0.1731 201 1199.1 97 975.6 390.7 0.0994 301 1261.7 96 1388.5 591.5 0.0699 401 1256.1 96 1858.1 771.9 0.0522 501 1220.1 96 2389.7 955.3 0.0406 601 1224.6 96 2856.3 1133.4 0.0340 701 1252.0 96 3258.7 1314.1 0.0298 915 1232.8 96 4319.7 1704.0 0.0225 amd64 2 2-core, 4Gb and SATA Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" extreme Jun 2 03:59:48 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 446.4 2.1 0.2173 101 533.4 97 1102.0 110.2 0.0880 201 578.3 97 2022.8 220.8 0.0480 301 583.8 97 3000.6 332.3 0.0323 401 580.5 97 4020.1 442.2 0.0241 501 574.8 98 5072.8 558.8 0.0191 600 566.5 98 6163.8 671.0 0.0157 Benchmark Version Machine Run Date AIM Multiuser Benchmark - Suite VII "1.1" vmemmap Jun 3 04:19:31 2007 Tasks Jobs/Min JTI Real CPU Jobs/sec/task 1 13.0 100 447.8 2.0 0.2166 101 536.5 97 1095.6 109.7 0.0885 201 567.7 97 2060.5 219.3 0.0471 301 582.1 96 3009.4 330.2 0.0322 401 578.2 96 4036.4 442.4 0.0240 501 585.1 98 4983.2 555.1 0.0195 600 565.5 98 6175.2 660.6 0.0157 This patch: Fix some spelling errors. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-14mm/migrate.c __user annotationAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-10Drop 'size' argument from bio_endio and bi_end_ioNeilBrown2-30/+7
As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-10Fix warnings with !CONFIG_BLOCKJens Axboe1-0/+1
Hide everything in blkdev.h with CONFIG_BLOCK isn't set, and fixup the (few) files that fail to build because they were relying on blkdev.h pulling in extra includes for them. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-10-08fix page release issue in filemap_faultYan Zheng1-0/+1
find_lock_page increases page's usage count, we should decrease it before return VM_FAULT_SIGBUS Signed-off-by: Yan Zheng<yanzheng@21cn.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-08fix VM_CAN_NONLINEAR check in sys_remap_file_pagesYan Zheng1-1/+1
The test for VM_CAN_NONLINEAR always fails Signed-off-by: Yan Zheng<yanzheng@21cn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-08mm: set_page_dirty_balance() vs ->page_mkwrite()Peter Zijlstra2-4/+9
All the current page_mkwrite() implementations also set the page dirty. Which results in the set_page_dirty_balance() call to _not_ call balance, because the page is already found dirty. This allows us to dirty a _lot_ of pages without ever hitting balance_dirty_pages(). Not good (tm). Force a balance call if ->page_mkwrite() was successful. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-06xen: disable split pte locks for nowJeremy Fitzhardinge1-0/+1
When pinning and unpinning pagetables, we must protect them against being used by other CPUs, lest they see the pagetable in an intermediate read-only-but-not-pinned state. When using split pte locks, doing this properly would require taking all the pte locks for the pagetable while pinning, but this may overflow the PREEMPT_BITS part of the preempt counter if the process has mapped more than about 512M of memory. However, failing to take the pte locks causes write-protect faults when the pageout code is trying to clear the Access bit on a pte which is part of a freshy created and still being pinned process after fork. This is a short-term fix until the problem is solved properly. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Keir Fraser <keir@xensource.com> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-04Fix sys_remap_file_pages BUG at highmem.c:15!Hugh Dickins1-8/+6
Gurudas Pai reports kernel BUG at arch/i386/mm/highmem.c:15! below sys_remap_file_pages, while running Oracle database test on x86 in 6GB RAM: kunmap thinks we're in_interrupt because the preempt count has wrapped. That's because __do_fault expected to unmap page_table, but one of its two callers do_nonlinear_fault already unmapped it: let do_linear_fault unmap it first too, and then there's no need to pass the page_table arg down. Why have we been so slow to notice this? Probably through forgetting that the mapping_cap_account_dirty test means that sys_remap_file_pages nowadays only goes the full nonlinear vma route on a few memory-backed filesystems like ramfs, tmpfs and hugetlbfs. [ It also depends on CONFIG_HIGHPTE, so it becomes even harder to trigger in practice. Many who have need of large memory have probably migrated to x86-64.. Problem introduced by commit d0217ac04ca6591841e5665f518e38064f4e65bd ("mm: fault feedback #1") -- Linus ] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: gurudas pai <gurudas.pai@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-01hugetlb: fix clear_user_highpage argumentsRalf Baechle1-1/+1
The virtual address space argument of clear_user_highpage is supposed to be the virtual address where the page being cleared will eventually be mapped. This allows architectures with virtually indexed caches a few clever tricks. That sort of trick falls over in painful ways if the virtual address argument is wrong. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-19Fix NUMA Memory Policy Reference CountingLee Schermerhorn2-10/+73
This patch proposes fixes to the reference counting of memory policy in the page allocation paths and in show_numa_map(). Extracted from my "Memory Policy Cleanups and Enhancements" series as stand-alone. Shared policy lookup [shmem] has always added a reference to the policy, but this was never unrefed after page allocation or after formatting the numa map data. Default system policy should not require additional ref counting, nor should the current task's task policy. However, show_numa_map() calls get_vma_policy() to examine what may be [likely is] another task's policy. The latter case needs protection against freeing of the policy. This patch adds a reference count to a mempolicy returned by get_vma_policy() when the policy is a vma policy or another task's mempolicy. Again, shared policy is already reference counted on lookup. A matching "unref" [__mpol_free()] is performed in alloc_page_vma() for shared and vma policies, and in show_numa_map() for shared and another task's mempolicy. We can call __mpol_free() directly, saving an admittedly inexpensive inline NULL test, because we know we have a non-NULL policy. Handling policy ref counts for hugepages is a bit trickier. huge_zonelist() returns a zone list that might come from a shared or vma 'BIND policy. In this case, we should hold the reference until after the huge page allocation in dequeue_hugepage(). The patch modifies huge_zonelist() to return a pointer to the mempolicy if it needs to be unref'd after allocation. Kernel Build [16cpu, 32GB, ia64] - average of 10 runs: w/o patch w/ refcount patch Avg Std Devn Avg Std Devn Real: 100.59 0.38 100.63 0.43 User: 1209.60 0.37 1209.91 0.31 System: 81.52 0.42 81.64 0.34 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-12SLUB: accurately compare debug flags during slab cache mergeChristoph Lameter1-15/+23
This was posted on Aug 28 and fixes an issue that could cause troubles when slab caches >=128k are created. http://marc.info/?l=linux-mm&m=118798149918424&w=2 Currently we simply add the debug flags unconditional when checking for a matching slab. This creates issues for sysfs processing when slabs exist that are exempt from debugging due to their huge size or because only a subset of slabs was selected for debugging. We need to only add the flags if kmem_cache_open() would also add them. Create a function to calculate the flags that would be set if the cache would be opened and use that function to determine the flags before looking for a compatible slab. [akpm@linux-foundation.org: fixlets] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Chuck Ebbert <cebbert@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31Page migration: Do not accept invalid nodes in the target nodesetChristoph Lameter1-0/+5
Page migration currently does not check if the target of the move contains nodes that that are invalid (if root attempts to migrate pages) and may try to allocate from invalid nodes if these are specified leading to oopses. Return -EINVAL if an offline node is specified. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31slub: do not fail if we cannot register a slab with sysfsChristoph Lameter1-2/+6
Do not BUG() if we cannot register a slab with sysfs. Just print an error. The only consequence of not registering is that the slab cache is not visible via /sys/slab. A BUG() may not be visible that early during boot and we have had multiple issues here already. Signed-off-by: Christoph Lameter <clameter@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31fix rcu_read_lock() in page migratonKAMEZAWA Hiroyuki1-2/+9
In migration fallback path, write_page() or lock_page() will be called. This causes sleep with holding rcu_read_lock(). For avoding that, just do rcu_lock if the page is Anon.(this is enough.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31process_zones(): fix recovery codeAndrew Morton1-0/+2
Don't try to free memory which we didn't allocate. Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLEMel Gorman2-1/+14
The NUMA layer only supports NUMA policies for the highest zone. When ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes ZONE_MOVABLE. The result is that policies are only applied to allocations like anonymous pages and page cache allocated from ZONE_MOVABLE when the zone is used. This patch applies policies to the two highest zones when the highest zone is ZONE_MOVABLE. As ZONE_MOVABLE consists of pages from the highest "real" zone, it's always functionally equivalent. The patch has been tested on a variety of machines both NUMA and non-NUMA covering x86, x86_64 and ppc64. No abnormal results were seen in kernbench, tbench, dbench or hackbench. It passes regression tests from the numactl package with and without kernelcore= once numactl tests are patched to wait for vmstat counters to update. akpm: this is the nasty hack to fix NUMA mempolicies in the presence of ZONE_MOVABLE and kernelcore= in 2.6.23. Christoph says "For .24 either merge the mobility or get the other solution that Mel is working on. That solution would only use a single zonelist per node and filter on the fly. That may help performance and also help to make memory policies work better." Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23SLUB: do not fail on broken memory configurationsChristoph Lameter1-1/+8
Print a big fat warning and do what is necessary to continue if a node is marked as up (meaning either node is online (upstream) or node has memory (Andrew's tree)) but allocations from the node do not succeed. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23SLUB: use atomic_long_read for atomic_long variablesChristoph Lameter1-3/+3
SLUB is using atomic_read() for variables declared atomic_long_t. Switch to atomic_long_read(). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23Fix VM_FAULT flags conversion for hugetlbAdam Litke1-1/+1
It seems a simple mistake was made when converting follow_hugetlb_page() over to the VM_FAULT flags bitmasks (in "mm: fault feedback #2", commit 83c54070ee1a2d05c89793884bea1a03f2851ed4). By using the wrong bitmask, hugetlb_fault() failures are not being recognized. This results in an infinite loop whenever follow_hugetlb_page is involved in a failed fault. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23slab: skip calling cache_free_alien() when the platform is not numa capableSiddha, Suresh B1-2/+12
Skip calling cache_free_alien() when the platform is not numa capable. This will avoid cache misses that happen while accessing slabp (which is per page memory reference) to get nodeid. Instead use a global variable to skip the call, which is mostly likely to be present in the cache. This gives a 0.8% performance boost with the database oltp workload on a quad-core SMP platform and by any means the number is not small :) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23fix NULL pointer dereference in __vm_enough_memory()Alan Cox2-4/+4
The new exec code inserts an accounted vma into an mm struct which is not current->mm. The existing memory check code has a hard coded assumption that this does not happen as does the security code. As the correct mm is known we pass the mm to the security method and the helper function. A new security test is added for the case where we need to pass the mm and the existing one is modified to pass current->mm to avoid the need to change large amounts of code. (Thanks to Tobias for fixing rejects and testing) Signed-off-by: Alan Cox <alan@redhat.com> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Cc: James Morris <jmorris@redhat.com> Cc: Tobias Diedrich <ranma+kernel@tdiedrich.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23synchronous lumpy reclaim: wait for page writeback when directly reclaiming ↵Andy Whitcroft1-8/+60
contiguous areas Lumpy reclaim works by selecting a lead page from the LRU list and then selecting pages for reclaim from the order-aligned area of pages. In the situation were all pages in that region are inactive and not referenced by any process over time, it works well. In the situation where there is even light load on the system, the pages may not free quickly. Out of a area of 1024 pages, maybe only 950 of them are freed when the allocation attempt occurs because lumpy reclaim returned early. This patch alters the behaviour of direct reclaim for large contiguous blocks. The first attempt to call shrink_page_list() is asynchronous but if it fails, the pages are submitted a second time and the calling process waits for the IO to complete. This may stall allocators waiting for contiguous memory but that should be expected behaviour for high-order users. It is preferable behaviour to potentially queueing unnecessary areas for IO. Note that kswapd will not stall in this fashion. [apw@shadowen.org: update to version 2] [apw@shadowen.org: update to version 3] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23synchronous lumpy reclaim: ensure we count pages transitioning inactive via ↵Andy Whitcroft1-0/+1
clear_active_flags As pointed out by Mel when reclaim is applied at higher orders a significant amount of IO may be started. As this takes finite time to drain reclaim will consider more areas than ultimatly needed to satisfy the request. This leads to more reclaim than strictly required and reduced success rates. I was able to confirm Mel's test results on systems locally. These show that even under light load the success rates drop off far more than expected. Testing with a modified version of his patch (which follows) I was able to allocate almost all of ZONE_MOVABLE with a near idle system. I ran 5 test passes sequentially following system boot (the system has 29 hugepages in ZONE_MOVABLE): 2.6.23-rc1 11 8 6 7 7 sync_lumpy 28 28 29 29 26 These show that although hugely better than the near 0% success normally expected we can only allocate about a 1/4 of the zone. Using synchronous reclaim for these allocations we get close to 100% as expected. I have also run our standard high order tests and these show no regressions in allocation success rates at rest, and some significant improvements under load. This patch: We are transitioning pages from active to inactive in clear_active_flags, those need counting as PGDEACTIVATE vm events. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-23sparsemem: ensure we initialise the node mapping for SPARSEMEM_STATICAndy Whitcroft1-4/+10
Booting SPARSEMEM on NUMA systems trips a BUG in page_alloc.c: Initializing HighMem for node 0 (00038000:00100000) Initializing HighMem for node 1 (00100000:001ffe00) ------------[ cut here ]------------ kernel BUG at /home/apw/git/linux-2.6/mm/page_alloc.c:456! [...] This occurs because the section to node id mapping is not being setup correctly during init under SPARSEMEM_STATIC, leading to an attempt to free pages from all nodes into the zones on node 0. When the zone_table[] was removed in the following commit, a new section to node mapping table was introduced: commit 89689ae7f95995723fbcd5c116c47933a3bb8b13 [PATCH] Get rid of zone_table[] That conversion inadvertantly only initialised the node mapping in SPARSEMEM_EXTREME. Ensure we initialise the node mapping in SPARSEMEM_STATIC. [akpm@linux-foundation.org: make the stubs static inline] Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-12Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-20/+0
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: BLOCK: Hide the contents of linux/bio.h if CONFIG_BLOCK=n sysace: HDIO_GETGEO has it's own method for ages drivers/block/cpqarray.c: better error handling and kmalloc + memset conversion to k[cz]alloc drivers/block/cciss.c: kmalloc + memset conversion to kzalloc Clean up duplicate includes in drivers/block/ Fix remap handling by blktrace [PATCH] remove mm/filemap.c:file_send_actor()
2007-08-12readahead: docbook fixStephen Hemminger1-1/+1
Minor docbook error since argument name in comment doesn't match function Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-12[PATCH] remove mm/filemap.c:file_send_actor()Adrian Bunk1-20/+0
This patch removes the no longer used file_send_actor(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-08-10SLUB: Fix dynamic dma kmalloc cache creationChristoph Lameter1-14/+45
The dynamic dma kmalloc creation can run into trouble if a GFP_ATOMIC allocation is the first one performed for a certain size of dma kmalloc slab. - Move the adding of the slab to sysfs into a workqueue (sysfs does GFP_KERNEL allocations) - Do not call kmem_cache_destroy() (uses slub_lock) - Only acquire the slub_lock once and--if we cannot wait--do a trylock. This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC) for a range of sizes failing due to another process holding the slub_lock. However, we only need to acquire the spinlock once in order to establish each power of two DMA kmalloc cache. The possible conflict is with the slub_lock taken during slab management actions (create / remove slab cache). It is rather typical that a driver will first fill its buffers using GFP_KERNEL allocations which will wait until the slub_lock can be acquired. Drivers will also create its slab caches first outside of an atomic context before starting to use atomic kmalloc from an interrupt context. If there are any failures then they will occur early after boot or when loading of multiple drivers concurrently. Drivers can already accomodate failures of GFP_ATOMIC for other reasons. Retries will then create the slab. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-08-10SLUB: Remove checks for MAX_PARTIAL from kmem_cache_shrinkChristoph Lameter1-7/+2
The MAX_PARTIAL checks were supposed to be an optimization. However, slab shrinking is a manually triggered process either through running slabinfo or by the kernel calling kmem_cache_shrink. If one really wants to shrink a slab then all operations should be done regardless of the size of the partial list. This also fixes an issue that could surface if the number of partial slabs was initially above MAX_PARTIAL in kmem_cache_shrink and later drops below MAX_PARTIAL through the elimination of empty slabs on the partial list (rare). In that case a few slabs may be left off the partial list (and only be put back when they are empty). Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-08-01fix filemap.c kernel-docRandy Dunlap1-1/+1
Fix kernel-doc warning: Warning(linux-2.6.23-rc1-mm1//mm/filemap.c:864): No description found for parameter 'ra' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-01oom: print points as unsigned longDavid Rientjes1-1/+1
In badness(), the automatic variable 'points' is unsigned long. Print it as such. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-01Do not trigger OOM-killer for high-order allocation failuresMel Gorman1-0/+4
out_of_memory() may be called when an allocation is failing and the direct reclaim is not making any progress. This does not take into account the requested order of the allocation. If the request if for an order larger than PAGE_ALLOC_COSTLY_ORDER, it is reasonable to fail the allocation because the kernel makes no guarantees about those allocations succeeding. This false OOM situation can occur if a user is trying to grow the hugepage pool in a script like; #!/bin/bash REQUIRED=$1 echo 1 > /proc/sys/vm/hugepages_treat_as_movable echo $REQUIRED > /proc/sys/vm/nr_hugepages ACTUAL=`cat /proc/sys/vm/nr_hugepages` while [ $REQUIRED -ne $ACTUAL ]; do echo Huge page pool at $ACTUAL growing to $REQUIRED echo $REQUIRED > /proc/sys/vm/nr_hugepages ACTUAL=`cat /proc/sys/vm/nr_hugepages` sleep 1 done This is a reasonable scenario when ZONE_MOVABLE is in use but triggers OOM easily on 2.6.23-rc1. This patch will fail an allocation for an order above PAGE_ALLOC_COSTLY_ORDER instead of killing processes and retrying. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@shadowen.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-30slub: fix bug in slub debug supportPeter Zijlstra1-1/+1
We ClearSlabDebug() before the last SlabDebug() check. Clear it later. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-07-30slub: add lock debugging checkPeter Zijlstra1-0/+1
Ingo noticed that the SLUB code does include the lock debugging free check. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-07-30Remove fs.h from mm.hAlexey Dobriyan3-0/+36
Remove fs.h from mm.h. For this, 1) Uninline vma_wants_writenotify(). It's pretty huge anyway. 2) Add back fs.h or less bloated headers (err.h) to files that need it. As result, on x86_64 allyesconfig, fs.h dependencies cut down from 3929 files rebuilt down to 3444 (-12.3%). Cross-compile tested without regressions on my two usual configs and (sigh): alpha arm-mx1ads mips-bigsur powerpc-ebony alpha-allnoconfig arm-neponset mips-capcella powerpc-g5 alpha-defconfig arm-netwinder mips-cobalt powerpc-holly alpha-up arm-netx mips-db1000 powerpc-iseries arm arm-ns9xxx mips-db1100 powerpc-linkstation arm-assabet arm-omap_h2_1610 mips-db1200 powerpc-lite5200 arm-at91rm9200dk arm-onearm mips-db1500 powerpc-maple arm-at91rm9200ek arm-picotux200 mips-db1550 powerpc-mpc7448_hpc2 arm-at91sam9260ek arm-pleb mips-ddb5477 powerpc-mpc8272_ads arm-at91sam9261ek arm-pnx4008 mips-decstation powerpc-mpc8313_rdb arm-at91sam9263ek arm-pxa255-idp mips-e55 powerpc-mpc832x_mds arm-at91sam9rlek arm-realview mips-emma2rh powerpc-mpc832x_rdb arm-ateb9200 arm-realview-smp mips-excite powerpc-mpc834x_itx arm-badge4 arm-rpc mips-fulong powerpc-mpc834x_itxgp arm-carmeva arm-s3c2410 mips-ip22 powerpc-mpc834x_mds arm-cerfcube arm-shannon mips-ip27 powerpc-mpc836x_mds arm-clps7500 arm-shark mips-ip32 powerpc-mpc8540_ads arm-collie arm-simpad mips-jazz powerpc-mpc8544_ds arm-corgi arm-spitz mips-jmr3927 powerpc-mpc8560_ads arm-csb337 arm-trizeps4 mips-malta powerpc-mpc8568mds arm-csb637 arm-versatile mips-mipssim powerpc-mpc85xx_cds arm-ebsa110 i386 mips-mpc30x powerpc-mpc8641_hpcn arm-edb7211 i386-allnoconfig mips-msp71xx powerpc-mpc866_ads arm-em_x270 i386-defconfig mips-ocelot powerpc-mpc885_ads arm-ep93xx i386-up mips-pb1100 powerpc-pasemi arm-footbridge ia64 mips-pb1500 powerpc-pmac32 arm-fortunet ia64-allnoconfig mips-pb1550 powerpc-ppc64 arm-h3600 ia64-bigsur mips-pnx8550-jbs powerpc-prpmc2800 arm-h7201 ia64-defconfig mips-pnx8550-stb810 powerpc-ps3 arm-h7202 ia64-gensparse mips-qemu powerpc-pseries arm-hackkit ia64-sim mips-rbhma4200 powerpc-up arm-integrator ia64-sn2 mips-rbhma4500 s390 arm-iop13xx ia64-tiger mips-rm200 s390-allnoconfig arm-iop32x ia64-up mips-sb1250-swarm s390-defconfig arm-iop33x ia64-zx1 mips-sead s390-up arm-ixp2000 m68k mips-tb0219 sparc arm-ixp23xx m68k-amiga mips-tb0226 sparc-allnoconfig arm-ixp4xx m68k-apollo mips-tb0287 sparc-defconfig arm-jornada720 m68k-atari mips-workpad sparc-up arm-kafa m68k-bvme6000 mips-wrppmc sparc64 arm-kb9202 m68k-hp300 mips-yosemite sparc64-allnoconfig arm-ks8695 m68k-mac parisc sparc64-defconfig arm-lart m68k-mvme147 parisc-allnoconfig sparc64-up arm-lpd270 m68k-mvme16x parisc-defconfig um-x86_64 arm-lpd7a400 m68k-q40 parisc-up x86_64 arm-lpd7a404 m68k-sun3 powerpc x86_64-allnoconfig arm-lubbock m68k-sun3x powerpc-cell x86_64-defconfig arm-lusl7200 mips powerpc-celleb x86_64-up arm-mainstone mips-atlas powerpc-chrp32 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-30Introduce CONFIG_SUSPEND for suspend-to-Ram and standbyRafael J. Wysocki1-2/+2
Introduce CONFIG_SUSPEND representing the ability to enter system sleep states, such as the ACPI S3 state, and allow the user to choose SUSPEND and HIBERNATION independently of each other. Make HOTPLUG_CPU be selected automatically if SUSPEND or HIBERNATION has been chosen and the kernel is intended for SMP systems. Also, introduce CONFIG_PM_SLEEP which is automatically selected if CONFIG_SUSPEND or CONFIG_HIBERNATION is set and use it to select the code needed for both suspend and hibernation. The top-level power management headers and the ACPI code related to suspend and hibernation are modified to use the new definitions (the changes in drivers/acpi/sleep/main.c are, mostly, moving code to reduce the number of ifdefs). There are many other files in which CONFIG_PM can be replaced with CONFIG_PM_SLEEP or even with CONFIG_SUSPEND, but they can be updated in the future. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-30Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATIONRafael J. Wysocki2-5/+5
Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION to avoid confusion (among other things, with CONFIG_SUSPEND introduced in the next patch). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26Allow nodes to exist that only contain ZONE_MOVABLEMel Gorman1-3/+3
With the introduction of kernelcore=, a configurable zone is created on request. In some cases, this value will be small enough that some nodes contain only ZONE_MOVABLE. On some NUMA configurations when this occurs, arch-independent zone-sizing will get the size of the memory holes within the node incorrect. The value of present_pages goes negative and the boot fails. This patch fixes the bug in the calculation of the size of the hole. The test case is to boot test a NUMA machine with a low value of kernelcore= before and after the patch is applied. While this bug exists in early kernel it cannot be triggered in practice. This patch has been boot-tested on a variety machines with and without kernelcore= set. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26memory unplug: isolate_lru_page fixKAMEZAWA Hiroyuki1-2/+1
release_pages() in mm/swap.c changes page_count() to be 0 without removing PageLRU flag... This means isolate_lru_page() can see a page, PageLRU() && page_count(page)==0.. This is BUG. (get_page() will be called against count=0 page.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-26memory unplug: migration by kernelKAMEZAWA Hiroyuki1-2/+19
In usual, migrate_pages(page,,) is called with holding mm->sem by system call. (mm here is a mm_struct which maps the migration target page.) This semaphore helps avoiding some race conditions. But, if we want to migrate a page by some kernel codes, we have to avoid some races. This patch adds check code for following race condition. 1. A page which page->mapping==NULL can be target of migration. Then, we have to check page->mapping before calling try_to_unmap(). 2. anon_vma can be freed while page is unmapped, but page->mapping remains as it was. We drop page->mapcount to be 0. Then we cannot trust page->mapping. So, use rcu_read_lock() to prevent anon_vma pointed by page->mapping from being freed during migration. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-24Merge branch 'request-queue-t' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-2/+2
* 'request-queue-t' of git://git.kernel.dk/linux-2.6-block: [BLOCK] Add request_queue_t and mark it deprecated [BLOCK] Get rid of request_queue_t typedef
2007-07-24slab: correctly handle __GFP_ZEROAndrew Morton1-1/+1
Use the correct local variable when calling into the page allocator. Local `flags' can have __GFP_ZERO set, which causes us to pass __GFP_ZERO into the page allocator, possibly from illegal contexts. The page allocator will later do prep_zero_page()->kmap_atomic(..., KM_USER0) from irq contexts and will then go BUG. Cc: Mike Galbraith <efault@gmx.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-24fix hugetlb page allocation leakKen Chen1-0/+1
dequeue_huge_page() has a serious memory leak upon hugetlb page allocation. The for loop continues on allocating hugetlb pages out of all allowable zone, where this function is supposedly only dequeue one and only one pages. Fixed it by breaking out of the for loop once a hugetlb page is found. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-24[BLOCK] Get rid of request_queue_t typedefJens Axboe1-2/+2
Some of the code has been gradually transitioned to using the proper struct request_queue, but there's lots left. So do a full sweet of the kernel and get rid of this typedef and replace its uses with the proper type. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-07-22x86_64: fix section mismatch warning in init.cSam Ravnborg1-1/+1
Fix following warning: WARNING: vmlinux.o(.text+0x188ea): Section mismatch: reference to .init.text:__alloc_bootmem_core (between 'alloc_bootmem_high_node' and 'get_gate_vma') alloc_bootmem_high_node() is only used from __init scope so declare it __init. And in addition declare the weak variant __init too. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-22slob: reduce list scanningMatt Mackall1-5/+16
The version of SLOB in -mm always scans its free list from the beginning, which results in small allocations and free segments clustering at the beginning of the list over time. This causes the average search to scan over a large stretch at the beginning on each allocation. By starting each page search where the last one left off, we evenly distribute the allocations and greatly shorten the average search. Without this patch, kernel compiles on a 1.5G machine take a large amount of system time for list scanning. With this patch, compiles are within a few seconds of performance of a SLAB kernel with no notable change in system time. Signed-off-by: Matt Mackall <mpm@selenic.com> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-22remove handle_mm_fault exportChristoph Hellwig1-2/+0
Now that arch/powerpc/platforms/cell/spufs/fault.c is always built in the kernel there is no need to export handle_mm_fault anymore. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>