summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables() harderDavid Gibson1-1/+1
Turns out the hugepage logic in free_pgtables() was doubly broken. The loop coalescing multiple normal page VMAs into one call to free_pgd_range() had an off by one error, which could mean it would coalesce one hugepage VMA into the same bundle (checking 'vma' not 'next' in the loop). I transferred this bug into the new is_vm_hugetlb_page() based version. Here's the fix. This one didn't bite on powerpc previously for the same reason the is_hugepage_only_range() problem didn't: powerpc's hugetlb_free_pgd_range() is identical to free_pgd_range(). It didn't bite on ia64 because the hugepage region is distant enough from any other region that the separated PMD_SIZE distance test would always prevent coalescing the two together. No libhugetlbfs testsuite regressions (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Fix hugepage logic in free_pgtables()David Gibson1-3/+2
free_pgtables() has special logic to call hugetlb_free_pgd_range() instead of the normal free_pgd_range() on hugepage VMAs. However, the test it uses to do so is incorrect: it calls is_hugepage_only_range on a hugepage sized range at the start of the vma. is_hugepage_only_range() will return true if the given range has any intersection with a hugepage address region, and in this case the given region need not be hugepage aligned. So, for example, this test can return true if called on, say, a 4k VMA immediately preceding a (nicely aligned) hugepage VMA. At present we get away with this because the powerpc version of hugetlb_free_pgd_range() is just a call to free_pgd_range(). On ia64 (the only other arch with a non-trivial is_hugepage_only_range()) we get away with it for a different reason; the hugepage area is not contiguous with the rest of the user address space, and VMAs are not permitted in between, so the test can't return a false positive there. Nonetheless this should be fixed. We do that in the patch below by replacing the is_hugepage_only_range() test with an explicit test of the VMA using is_vm_hugetlb_page(). This in turn changes behaviour for platforms where is_hugepage_only_range() returns false always (everything except powerpc and ia64). We address this by ensuring that hugetlb_free_pgd_range() is defined to be identical to free_pgd_range() (instead of a no-op) on everything except ia64. Even so, it will prevent some otherwise possible coalescing of calls down to free_pgd_range(). Since this only happens for hugepage VMAs, removing this small optimization seems unlikely to cause any trouble. This patch causes no regressions on the libhugetlbfs testsuite - ppc64 POWER5 (8-way), ppc64 G5 (2-way) and i386 Pentium M (UP). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Make {alloc,free}_huge_page() localDavid Gibson1-12/+13
Originally, mm/hugetlb.c just handled the hugepage physical allocation path and its {alloc,free}_huge_page() functions were used from the arch specific hugepage code. These days those functions are only used with mm/hugetlb.c itself. Therefore, this patch makes them static and removes their prototypes from hugetlb.h. This requires a small rearrangement of code in mm/hugetlb.c to avoid a forward declaration. This patch causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Strict page reservation for hugepage inodesDavid Gibson1-10/+126
These days, hugepages are demand-allocated at first fault time. There's a somewhat dubious (and racy) heuristic when making a new mmap() to check if there are enough available hugepages to fully satisfy that mapping. A particularly obvious case where the heuristic breaks down is where a process maps its hugepages not as a single chunk, but as a bunch of individually mmap()ed (or shmat()ed) blocks without touching and instantiating the pages in between allocations. In this case the size of each block is compared against the total number of available hugepages. It's thus easy for the process to become overcommitted, because each block mapping will succeed, although the total number of hugepages required by all blocks exceeds the number available. In particular, this defeats such a program which will detect a mapping failure and adjust its hugepage usage downward accordingly. The patch below addresses this problem, by strictly reserving a number of physical hugepages for hugepage inodes which have been mapped, but not instatiated. MAP_SHARED mappings are thus "safe" - they will fail on mmap(), not later with an OOM SIGKILL. MAP_PRIVATE mappings can still trigger an OOM. (Actually SHARED mappings can technically still OOM, but only if the sysadmin explicitly reduces the hugepage pool between mapping and instantiation) This patch appears to address the problem at hand - it allows DB2 to start correctly, for instance, which previously suffered the failure described above. This patch causes no regressions on the libhugetblfs testsuite, and makes a test (designed to catch this problem) pass which previously failed (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: serialize hugepage allocation and instantiationDavid Gibson1-7/+18
Currently, no lock or mutex is held between allocating a hugepage and inserting it into the pagetables / page cache. When we do go to insert the page into pagetables or page cache, we recheck and may free the newly allocated hugepage. However, since the number of hugepages in the system is strictly limited, and it's usualy to want to use all of them, this can still lead to spurious allocation failures. For example, suppose two processes are both mapping (MAP_SHARED) the same hugepage file, large enough to consume the entire available hugepage pool. If they race instantiating the last page in the mapping, they will both attempt to allocate the last available hugepage. One will fail, of course, returning OOM from the fault and thus causing the process to be killed, despite the fact that the entire mapping can, in fact, be instantiated. The patch fixes this race by the simple method of adding a (sleeping) mutex to serialize the hugepage fault path between allocation and insertion into pagetables and/or page cache. It would be possible to avoid the serialization by catching the allocation failures, waiting on some condition, then rechecking to see if someone else has instantiated the page for us. Given the likely frequency of hugepage instantiations, it seems very doubtful it's worth the extra complexity. This patch causes no regression on the libhugetlbfs testsuite, and one test, which can trigger this race now passes where it previously failed. Actually, the test still sometimes fails, though less often and only as a shmat() failure, rather processes getting OOM killed by the VM. The dodgy heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage space aren't protected by the new mutex, and would be ugly to do so, so there's still a race there. Another patch to replace those tests with something saner for this reason as well as others coming... Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage: Small fixes to hugepage clear/copy pathDavid Gibson1-7/+26
Move the loops used in mm/hugetlb.c to clear and copy hugepages to their own functions for clarity. As we do so, we add some checks of need_resched - we are, after all copying megabytes of memory here. We also add might_sleep() accordingly. We generally dropped locks around the clear and copy, already but not everyone has PREEMPT enabled, so we should still be checking explicitly. For this to work, we need to remove the clear_huge_page() from alloc_huge_page(), which is called with the page_table_lock held in the COW path. We move the clear_huge_page() to just after the alloc_huge_page() in the hugepage no-page path. In the COW path, the new page is about to be copied over, so clearing it was just a waste of time anyway. So as a side effect we also fix the fact that we held the page_table_lock for far too long in this path by calling alloc_huge_page() under it. It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5). Signed-off-by: David Gibson <dwg@au1.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] Enable mprotect on huge pagesZhang, Yanmin2-7/+34
2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb mprotect. From: David Gibson <david@gibson.dropbear.id.au> Remove a test from the mprotect() path which checks that the mprotect()ed range on a hugepage VMA is hugepage aligned (yes, really, the sense of is_aligned_hugepage_range() is the opposite of what you'd guess :-/). In fact, we don't need this test. If the given addresses match the beginning/end of a hugepage VMA they must already be suitably aligned. If they don't, then mprotect_fixup() will attempt to split the VMA. The very first test in split_vma() will check for a badly aligned address on a hugepage VMA and return -EINVAL if necessary. From: "Chen, Kenneth W" <kenneth.w.chen@intel.com> On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE. The identify of hugetlb pte is lost when changing page protection via mprotect. A page fault occurs later will trigger a bug check in huge_pte_alloc(). The fix is to always make new pte a hugetlb pte and also to clean up legacy code where _PAGE_PRESENT is forced on in the pre-faulting day. Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] readahead: fix initial window size calculationSteven Pratt1-3/+3
The current current get_init_ra_size is not optimal across different IO sizes and max_readahead values. Here is a quick summary of sizes computed under current design and under the attached patch. All of these assume 1st IO at offset 0, or 1st detected sequential IO. 32k max, 4k request old new ----------------- 8k 8k 16k 16k 32k 32k 128k max, 4k request old new ----------------- 32k 16k 64k 32k 128k 64k 128k 128k 128k max, 32k request old new ----------------- 32k 64k <----- 64k 128k 128k 128k 512k max, 4k request old new ----------------- 4k 32k <---- 16k 64k 64k 128k 128k 256k 512k 512k Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] readahead: ->prev_page can overrun the ahead windowOleg Nesterov1-6/+20
If get_next_ra_size() does not grow fast enough, ->prev_page can overrun the ahead window. This means the caller will read the pages from ->ahead_start + ->ahead_size to ->prev_page synchronously. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] shmem: inline to avoid warningHugh Dickins1-1/+1
shmem.c was named and shamed in Jesper's "Building 100 kernels" warnings: shmem_parse_mpol is only used when CONFIG_TMPFS parses mount options; and only called from that one site, so mark it inline like its non-NUMA stub. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan: emove obsolete checks from shrink_list() and fix unlikely in ↵Christoph Lameter1-11/+2
refill_inactive_zone() As suggested by Marcelo: 1. The optimization introduced recently for not calling page_referenced() during zone reclaim makes two additional checks in shrink_list unnecessary. 2. The if (unlikely(sc->may_swap)) in refill_inactive_zone is optimized for the zone_reclaim case. However, most peoples system only does swap. Undo that. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Marcelo Tosatti <marcelo.tosatti@cyclades.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: more CONFIG_DEBUG_VMNick Piggin2-14/+7
Put a few more checks under CONFIG_DEBUG_VM Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: prep_zero_page() in irq is a bugAndrew Morton1-0/+5
prep_zero_page() uses KM_USER0 and hence may not be used from IRQ context, at least for highmem pages. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: cleanup prep_ stuffNick Piggin1-17/+18
Move the prep_ stuff into prep_new_page. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] remove set_page_count() outside mm/Nick Piggin3-11/+21
set_page_count usage outside mm/ is limited to setting the refcount to 1. Remove set_page_count from outside mm/, and replace those users with init_page_count() and set_page_refcounted(). This allows more debug checking, and tighter control on how code is allowed to play around with page->_count. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: nommu use compound pagesNick Piggin4-22/+10
Now that compound page handling is properly fixed in the VM, move nommu over to using compound pages rather than rolling their own refcounting. nommu vm page refcounting is broken anyway, but there is no need to have divergent code in the core VM now, nor when it gets fixed. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: David Howells <dhowells@redhat.com> (Needs testing, please). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: make __put_page internalNick Piggin3-0/+15
Remove __put_page from outside the core mm/. It is dangerous because it does not handle compound pages nicely, and misses 1->0 transitions. If a user later appears that really needs the extra speed we can reevaluate. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] remove VM_DONTCOPY bogositiesHugh Dickins1-9/+1
Now that it's madvisable, remove two pieces of VM_DONTCOPY bogosity: 1. There was and is no logical reason why VM_DONTCOPY should be in the list of flags which forbid vma merging (and those drivers which set it are also setting VM_IO, which itself forbids the merge). 2. It's hard to understand the purpose of the VM_HUGETLB, VM_DONTCOPY block in vm_stat_account: but never mind, it's under CONFIG_HUGETLB, which (unlike CONFIG_HUGETLB_PAGE or CONFIG_HUGETLBFS) has never been defined. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: shrink_inactive_lis() nr_scan accounting fixWu Fengguang1-4/+5
In shrink_inactive_list(), nr_scan is not accounted when nr_taken is 0. But 0 pages taken does not mean 0 pages scanned. Move the goto statement below the accounting code to fix it. Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: isolate_lru_pages() scan count fixWu Fengguang1-2/+2
In isolate_lru_pages(), *scanned reports one more scan because the scan counter is increased one more time on exit of the while-loop. Change the while-loop to for-loop to fix it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] zone_reclaim: additional comments and cleanupChristoph Lameter1-4/+14
Add some comments to explain how zone reclaim works. And it fixes the following issues: - PF_SWAPWRITE needs to be set for RECLAIM_SWAP to be able to write out pages to swap. Currently RECLAIM_SWAP may not do that. - remove setting nr_reclaimed pages after slab reclaim since the slab shrinking code does not use that and the nr_reclaimed pages is just right for the intended follow up action. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan: rename functionsAndrew Morton1-15/+17
We have: try_to_free_pages ->shrink_caches(struct zone **zones, ..) ->shrink_zone(struct zone *, ...) ->shrink_cache(struct zone *, ...) ->shrink_list(struct list_head *, ...) ->refill_inactive_list((struct zone *, ...) which is fairly irrational. Rename things so that we have try_to_free_pages ->shrink_zones(struct zone **zones, ..) ->shrink_zone(struct zone *, ...) ->shrink_inactive_list(struct zone *, ...) ->shrink_page_list(struct list_head *, ...) ->shrink_active_list(struct zone *, ...) Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan return nr_reclaimedAndrew Morton1-39/+38
Change all the vmscan functions to retunr the number-of-reclaimed pages and remove scan_conrtol.nr_reclaimed. Saves ten-odd bytes of text and makes things clearer and more consistent. The patch also changes the behaviour of zone_reclaim() when it falls back to slab shrinking. Christoph says "Setting this to one means that we will rescan and shrink the slab for each allocation if we are out of zone memory and RECLAIM_SLAB is set. Plus if we do an order 0 allocation we do not go off node as intended. "We better set this to zero. This means the allocation will go offnode despite us having potentially freed lots of memory on the zone. Future allocations can then again be done from this zone." Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan: use unsigned longsAndrew Morton1-45/+59
Turn basically everything in vmscan.c into `unsigned long'. This is to avoid the possibility that some piece of code in there might decide to operate upon more than 4G (or even 2G) of pages in one hit. This might be silly, but we'll need it one day. Cc: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] vmscan: scan_control cleanupAndrew Morton1-46/+62
Initialise as much of scan_control as possible at the declaration site. This tidies things up a bit and assures us that all unmentioned fields are zeroed out. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] Thin out scan_control: remove nr_to_scan and priorityChristoph Lameter1-34/+25
Make nr_to_scan and priority a parameter instead of putting it into scan control. This allows various small optimizations and IMHO makes the code easier to read. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: use on_each_cpu()Andrew Morton1-19/+2
Slab duplicates on_each_cpu(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: Remove SLAB_NO_REAP optionChristoph Lameter1-11/+2
SLAB_NO_REAP is documented as an option that will cause this slab not to be reaped under memory pressure. However, that is not what happens. The only thing that SLAB_NO_REAP controls at the moment is the reclaim of the unused slab elements that were allocated in batch in cache_reap(). Cache_reap() is run every few seconds independently of memory pressure. Could we remove the whole thing? Its only used by three slabs anyways and I cannot find a reason for having this option. There is an additional problem with SLAB_NO_REAP. If set then the recovery of objects from alien caches is switched off. Objects not freed on the same node where they were initially allocated will only be reused if a certain amount of objects accumulates from one alien node (not very likely) or if the cache is explicitly shrunk. (Strangely __cache_shrink does not check for SLAB_NO_REAP) Getting rid of SLAB_NO_REAP fixes the problems with alien cache freeing. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: fix kernel-doc warningsRandy Dunlap1-3/+12
Fix kernel-doc warnings in mm/slab.c. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: kill kmem_cache_t usagePekka Enberg4-8/+10
We have struct kmem_cache now so use it instead of the old typedef. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: remove cachep->spinlockRavikiran G Thirumalai1-11/+9
Remove cachep->spinlock. Locking has moved to the kmem_list3 and most of the structures protected earlier by cachep->spinlock is now protected by the l3->list_lock. slab cache tunables like batchcount are accessed always with the cache_chain_mutex held. Patch tested on SMP and NUMA kernels with dbench processes running, constant onlining/offlining, and constant cache tuning, all at the same time. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <christoph@lameter.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab cleanupAndrew Morton1-292/+304
slab.c has become a bit revolting again. Try to repair it. - Coding style fixes - Don't do assignments-in-if-statements. - Don't typecast assignments to/from void* Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: extract setup_cpu_cachePekka Enberg1-54/+55
Extract setup_cpu_cache() function from kmem_cache_create() to make the latter a little less complex. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] slab: object to index mapping cleanupPekka Enberg1-11/+23
Clean up the object to index mapping that has been spread around mm/slab.c. Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] hugepage allocator cleanupNick Piggin1-16/+8
Insert "fresh" huge pages into the hugepage allocator by the same means as they are freed back into it. This reduces code size and allows enqueue_huge_page to be inlined into the hugepage free fastpath. Eliminate occurances of hugepages on the free list with non-zero refcount. This can allow stricter refcount checks in future. Also required for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> "This patch also eliminates a leak "cleaned up" by re-clobbering the refcount on every allocation from the hugepage freelists. With respect to the lockless pagecache, the crucial aspect is to eliminate unconditional set_page_count() to 0 on pages with potentially nonzero refcounts, though closer inspection suggests the assignments removed are entirely spurious." Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: cleanup bootmemNick Piggin1-13/+7
The bootmem code added to page_alloc.c duplicated some page freeing code that it really doesn't need to because it is not so performance critical. While we're here, make prefetching work properly by actually prefetching the page we're about to use before prefetching ahead to the next one (ie. get the most important transaction started first). Also prefetch just a single page ahead rather than leaving a gap of 16. Jack Steiner reported no problems with SGI's ia64 simulator. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: split highorder pagesNick Piggin2-3/+23
Have an explicit mm call to split higher order pages into individual pages. Should help to avoid bugs and be more explicit about the code's intention. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: simplify vmscan vs release refcountingNick Piggin1-14/+11
The VM has an interesting race where a page refcount can drop to zero, but it is still on the LRU lists for a short time. This was solved by testing a 0->1 refcount transition when picking up pages from the LRU, and dropping the refcount in that case. Instead, use atomic_add_unless to ensure we never pick up a 0 refcount page from the LRU, thus a 0 refcount page will never have its refcount elevated until it is allocated again. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: slab less atomicsNick Piggin1-3/+3
Atomic operation removal from slab Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: page_alloc less atomicsNick Piggin1-2/+2
More atomic operation removal from page allocator Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: less atomic opsNick Piggin1-2/+2
In the page release paths, we can be sure that nobody will mess with our page->flags because the refcount has dropped to 0. So no need for atomic operations here. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: PageActive no testsetNick Piggin2-4/+5
PG_active is protected by zone->lru_lock, it does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: PageLRU no testsetNick Piggin2-17/+19
PG_lru is protected by zone->lru_lock. It does not need TestSet/TestClear operations. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] mm: never ClearPageLRU released pagesNick Piggin2-32/+38
If vmscan finds a zero refcount page on the lru list, never ClearPageLRU it. This means the release code need not hold ->lru_lock to stabilise PageLRU, so that lock may be skipped entirely when releasing !PageLRU pages (because we know PageLRU won't have been temporarily cleared by vmscan, which was previously guaranteed by holding the lock to synchronise against vmscan). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22[PATCH] __get_page_state() cpumask cleanup and fixAndrew Morton1-11/+9
__get_page_state() has an open-coded for_each_cpu_mask() loop in it. Tidy that up, then notice that the code was buggy: while (cpu < NR_CPUS) { unsigned long *in, *out, off; if (!cpu_isset(cpu, *cpumask)) continue; an obvious infinite loop. I guess we just never call it with a holey cpu mask. Even after my cpumask size-reduction work, this patch increases code size :( Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] fix free swap cache latencyHugh Dickins1-2/+3
Lee Revell reported 28ms latency when process with lots of swapped memory exits. 2.6.15 introduced a latency regression when unmapping: in accounting the zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but a swap entry counted nothing at all. We think of pages present as the slow case, but Lee's trace shows that free_swap_and_cache's radix tree lookup can make a lot of work - and we could have been doing it many thousands of times without a latency break. Move the zap_work update up to account swap entries like pages present. This does account non-linear pte_file entries, and unmap_mapping_range skipping over swap entries, by the same amount even though they're quick: but neither of those cases deserves complicating the code (and they're treated no worse than they were in 2.6.14). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] fix race in pagevec_strip?Christoph Lameter1-1/+2
We can call try_to_release_page() with PagePrivate off and a valid page->mapping This may cause all sorts of trouble for the filesystem *_releasepage() handlers. XFS bombs out in that case. Lock the page before checking for page private. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-17[PATCH] page migration: Fail with error if swap not setupChristoph Lameter1-2/+12
Currently the migration of anonymous pages will silently fail if no swap is setup. This patch makes page migration functions check for available swap and fail with -ENODEV if no swap space is available. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-15[PATCH] Consistent capabilites associated with MPOL_MOVE_ALLChristoph Lameter1-4/+4
It seems that setting scheduling policy and priorities is also the kind of thing that might be performed in apps that also use the NUMA API, so it would seem consistent to use CAP_SYS_NICE for NUMA also. So use CAP_SYS_NICE for controlling migration permissions. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-15[PATCH] page migration: fail if page is in a vma flagged VM_LOCKEDChristoph Lameter1-6/+12
page migration currently simply retries a couple of times if try_to_unmap() fails without inspecting the return code. However, SWAP_FAIL indicates that the page is in a vma that has the VM_LOCKED flag set (if ignore_refs ==1). We can check for that return code and avoid retrying the migration. migrate_page_remove_references() now needs to return a reason why the failure occured. So switch migrate_page_remove_references to use -Exx style error messages. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>