From a2c43eed8334e878702fca713b212ae2a11d84b9 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:36 -0800 Subject: mm: try_to_free_swap replaces remove_exclusive_swap_page remove_exclusive_swap_page(): its problem is in living up to its name. It doesn't matter if someone else has a reference to the page (raised page_count); it doesn't matter if the page is mapped into userspace (raised page_mapcount - though that hints it may be worth keeping the swap): all that matters is that there be no more references to the swap (and no writeback in progress). swapoff (try_to_unuse) has been removing pages from swapcache for years, with no concern for page count or page mapcount, and we used to have a comment in lookup_swap_cache() recognizing that: if you go for a page of swapcache, you'll get the right page, but it could have been removed from swapcache by the time you get page lock. So, give up asking for exclusivity: get rid of remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and remove_exclusive_swap_page_count() which were spawned for the recent LRU work: replace them by the simpler try_to_free_swap() which just checks page_swapcount(). Similarly, remove the page_count limitation from free_swap_and_count(), but assume that it's worth holding on to the swap if page is mapped and swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()? It would be consistent, but I think we probably have enough for now. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d196f46c8808..c8601dd36603 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -759,7 +759,7 @@ cull_mlocked: activate_locked: /* Not a candidate for swapping, so reclaim swap space. */ if (PageSwapCache(page) && vm_swap_full()) - remove_exclusive_swap_page_ref(page); + try_to_free_swap(page); VM_BUG_ON(PageActive(page)); SetPageActive(page); pgactivate++; -- cgit v1.2.3 From 63d6c5ad7fc27455ce5cb4706884671fb7e0df08 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:38 -0800 Subject: mm: remove try_to_munlock from vmscan An unfortunate feature of the Unevictable LRU work was that reclaiming an anonymous page involved an extra scan through the anon_vma: to check that the page is evictable before allocating swap, because the swap could not be freed reliably soon afterwards. Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's not an issue any more: remove try_to_munlock() call from shrink_page_list(), leaving it to try_to_munmap() to discover if the page is one to be culled to the unevictable list - in which case then try_to_free_swap(). Update unevictable-lru.txt to remove comments on the try_to_munlock() in shrink_page_list(), and shorten some lines over 80 columns. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/unevictable-lru.txt | 63 +++++++++++------------------------- mm/vmscan.c | 11 ++----- 2 files changed, 20 insertions(+), 54 deletions(-) (limited to 'mm/vmscan.c') diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt index 125eed560e5a..0706a7282a8c 100644 --- a/Documentation/vm/unevictable-lru.txt +++ b/Documentation/vm/unevictable-lru.txt @@ -137,13 +137,6 @@ shrink_page_list() where they will be detected when vmscan walks the reverse map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() will cull the page at that point. -Note that for anonymous pages, shrink_page_list() attempts to add the page to -the swap cache before it tries to unmap the page. To avoid this unnecessary -consumption of swap space, shrink_page_list() calls try_to_munlock() to check -whether any VM_LOCKED vmas map the page without attempting to unmap the page. -If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page -without consuming swap space. try_to_munlock() will be described below. - To "cull" an unevictable page, vmscan simply puts the page back on the lru list using putback_lru_page()--the inverse operation to isolate_lru_page()-- after dropping the page lock. Because the condition which makes the page @@ -190,8 +183,8 @@ several places: in the VM_LOCKED flag being set for the vma. 3) in the fault path, if mlocked pages are "culled" in the fault path, and when a VM_LOCKED stack segment is expanded. -4) as mentioned above, in vmscan:shrink_page_list() with attempting to - reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). +4) as mentioned above, in vmscan:shrink_page_list() when attempting to + reclaim a page in a VM_LOCKED vma via try_to_unmap(). Mlocked pages become unlocked and rescued from the unevictable list when: @@ -260,9 +253,9 @@ mlock_fixup() filters several classes of "special" vmas: 2) vmas mapping hugetlbfs page are already effectively pinned into memory. We don't need nor want to mlock() these pages. However, to preserve the - prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() - will call make_pages_present() in the hugetlbfs vma range to allocate the - huge pages and populate the ptes. + prior behavior of mlock()--before the unevictable/mlock changes-- + mlock_fixup() will call make_pages_present() in the hugetlbfs vma range + to allocate the huge pages and populate the ptes. 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of kernel pages, such as the vdso page, relay channel pages, etc. These pages @@ -322,7 +315,7 @@ __mlock_vma_pages_range()--the same function used to mlock a vma range-- passing a flag to indicate that munlock() is being performed. Because the vma access protections could have been changed to PROT_NONE after -faulting in and mlocking some pages, get_user_pages() was unreliable for visiting +faulting in and mlocking pages, get_user_pages() was unreliable for visiting these pages for munlocking. Because we don't want to leave pages mlocked(), get_user_pages() was enhanced to accept a flag to ignore the permissions when fetching the pages--all of which should be resident as a result of previous @@ -416,8 +409,8 @@ Mlocked Pages: munmap()/exit()/exec() System Call Handling When unmapping an mlocked region of memory, whether by an explicit call to munmap() or via an internal unmap from exit() or exec() processing, we must munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. -Before the unevictable/mlock changes, mlocking did not mark the pages in any way, -so unmapping them required no processing. +Before the unevictable/mlock changes, mlocking did not mark the pages in any +way, so unmapping them required no processing. To munlock a range of memory under the unevictable/mlock infrastructure, the munmap() hander and task address space tear down function call @@ -517,12 +510,10 @@ couldn't be mlocked. Mlocked pages: try_to_munlock() Reverse Map Scan TODO/FIXME: a better name might be page_mlocked()--analogous to the -page_referenced() reverse map walker--especially if we continue to call this -from shrink_page_list(). See related TODO/FIXME below. +page_referenced() reverse map walker. -When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System -Call Handling" above--tries to munlock a page, or when shrink_page_list() -encounters an anonymous page that is not yet in the swap cache, they need to +When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() +System Call Handling" above--tries to munlock a page, it needs to determine whether or not the page is mapped by any VM_LOCKED vma, without actually attempting to unmap all ptes from the page. For this purpose, the unevictable/mlock infrastructure introduced a variant of try_to_unmap() called @@ -535,10 +526,7 @@ for VM_LOCKED vmas. When such a vma is found for anonymous pages and file pages mapped in linear VMAs, as in the try_to_unmap() case, the functions attempt to acquire the associated mmap semphore, mlock the page via mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the -pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs -shrink_page_list() that the anonymous page should be culled rather than added -to the swap cache in preparation for a try_to_unmap() that will almost -certainly fail. +pre-clearing of the page's PG_mlocked done by munlock_vma_page. If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() @@ -557,10 +545,7 @@ However, the scan can terminate when it encounters a VM_LOCKED vma and can successfully acquire the vma's mmap semphore for read and mlock the page. Although try_to_munlock() can be called many [very many!] times when munlock()ing a large region or tearing down a large address space that has been -mlocked via mlockall(), overall this is a fairly rare event. In addition, -although shrink_page_list() calls try_to_munlock() for every anonymous page that -it handles that is not yet in the swap cache, on average anonymous pages will -have very short reverse map lists. +mlocked via mlockall(), overall this is a fairly rare event. Mlocked Page: Page Reclaim in shrink_*_list() @@ -588,8 +573,8 @@ Some examples of these unevictable pages on the LRU lists are: munlock_vma_page() was forced to let the page back on to the normal LRU list for vmscan to handle. -shrink_inactive_list() also culls any unevictable pages that it finds -on the inactive lists, again diverting them to the appropriate zone's unevictable +shrink_inactive_list() also culls any unevictable pages that it finds on +the inactive lists, again diverting them to the appropriate zone's unevictable lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from @@ -597,19 +582,7 @@ the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter, but will pass on to shrink_page_list(). shrink_page_list() again culls obviously unevictable pages that it could -encounter for similar reason to shrink_inactive_list(). As already discussed, -shrink_page_list() proactively looks for anonymous pages that should have -PG_mlocked set but don't--these would not be detected by page_evictable()--to -avoid adding them to the swap cache unnecessarily. File pages mapped into +encounter for similar reason to shrink_inactive_list(). Pages mapped into VM_LOCKED vmas but without PG_mlocked set will make it all the way to -try_to_unmap(). shrink_page_list() will divert them to the unevictable list when -try_to_unmap() returns SWAP_MLOCK, as discussed above. - -TODO/FIXME: If we can enhance the swap cache to reliably remove entries -with page_count(page) > 2, as long as all ptes are mapped to the page and -not the swap entry, we can probably remove the call to try_to_munlock() in -shrink_page_list() and just remove the page from the swap cache when -try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() -doesn't seem to allow that. - - +try_to_unmap(). shrink_page_list() will divert them to the unevictable list +when try_to_unmap() returns SWAP_MLOCK, as discussed above. diff --git a/mm/vmscan.c b/mm/vmscan.c index c8601dd36603..74f875733e2b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,15 +625,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - switch (try_to_munlock(page)) { - case SWAP_FAIL: /* shouldn't happen */ - case SWAP_AGAIN: - goto keep_locked; - case SWAP_MLOCK: - goto cull_mlocked; - case SWAP_SUCCESS: - ; /* fall thru'; add to swap cache */ - } if (!add_to_swap(page, GFP_ATOMIC)) goto activate_locked; may_enter_fs = 1; @@ -752,6 +743,8 @@ free_it: continue; cull_mlocked: + if (PageSwapCache(page)) + try_to_free_swap(page); unlock_page(page); putback_lru_page(page); continue; -- cgit v1.2.3 From ac47b003d03c2a4f28aef1d505b66d24ad191c4f Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:39 -0800 Subject: mm: remove gfp_mask from add_to_swap Remove gfp_mask argument from add_to_swap(): it's misleading because its only caller, shrink_page_list(), is not atomic at that point; and in due course (implementing discard) we'll sometimes want to allocate some memory with GFP_NOIO (as is used in swap_writepage) when allocating swap. No change to the gfp_mask passed down to add_to_swap_cache(): still use __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before): though it's not obvious if that's the best combination to ask for here. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 2 +- mm/swap_state.c | 4 ++-- mm/vmscan.c | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/swap.h b/include/linux/swap.h index c3ecd478840e..c38bd157695b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -278,7 +278,7 @@ extern void end_swap_bio_read(struct bio *bio, int err); extern struct address_space swapper_space; #define total_swapcache_pages swapper_space.nrpages extern void show_swap_cache_info(void); -extern int add_to_swap(struct page *, gfp_t); +extern int add_to_swap(struct page *); extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); extern void __delete_from_swap_cache(struct page *); extern void delete_from_swap_cache(struct page *); diff --git a/mm/swap_state.c b/mm/swap_state.c index bcb472769299..81c825f67a7f 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,7 +128,7 @@ void __delete_from_swap_cache(struct page *page) * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page * page, gfp_t gfp_mask) +int add_to_swap(struct page *page) { swp_entry_t entry; int err; @@ -153,7 +153,7 @@ int add_to_swap(struct page * page, gfp_t gfp_mask) * Add it to the swap cache and mark it dirty */ err = add_to_swap_cache(page, entry, - gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN); + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); switch (err) { case 0: /* Success */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 74f875733e2b..cc7401571cba 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,7 +625,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - if (!add_to_swap(page, GFP_ATOMIC)) + if (!add_to_swap(page)) goto activate_locked; may_enter_fs = 1; } -- cgit v1.2.3 From 60371d971a3d01afd102f0bbf2681f32ecc31d78 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:40 -0800 Subject: mm: add add_to_swap stub If we add a failing stub for add_to_swap(), then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c. This was intended as a source cleanup, but looking more closely, it turns out that the !CONFIG_SWAP case was going to keep_locked for an anonymous page, whereas now it goes to the more suitable activate_locked, like the CONFIG_SWAP nr_swap_pages 0 case. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 5 +++++ mm/vmscan.c | 2 -- 2 files changed, 5 insertions(+), 2 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/swap.h b/include/linux/swap.h index c38bd157695b..c0d23ac710d5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -371,6 +371,11 @@ static inline struct page *lookup_swap_cache(swp_entry_t swp) return NULL; } +static inline int add_to_swap(struct page *page) +{ + return 0; +} + static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { diff --git a/mm/vmscan.c b/mm/vmscan.c index cc7401571cba..f350523a8eee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -617,7 +617,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, referenced && page_mapping_inuse(page)) goto activate_locked; -#ifdef CONFIG_SWAP /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. @@ -629,7 +628,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto activate_locked; may_enter_fs = 1; } -#endif /* CONFIG_SWAP */ mapping = page_mapping(page); -- cgit v1.2.3 From b962716b459505a8d83aea313fea0abe76749f42 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:41 -0800 Subject: mm: optimize get_scan_ratio for no swap Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual declaration to swapfile.c: it never belonged in page_alloc.c. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/swap.h | 5 +++-- mm/page_alloc.c | 1 - mm/swapfile.c | 1 + mm/vmscan.c | 12 ++++++------ 4 files changed, 10 insertions(+), 9 deletions(-) (limited to 'mm/vmscan.c') diff --git a/include/linux/swap.h b/include/linux/swap.h index c0d23ac710d5..3a31cc25bd2c 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -163,7 +163,6 @@ struct swap_list_t { /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; extern unsigned long totalreserve_pages; -extern long nr_swap_pages; extern unsigned int nr_free_buffer_pages(void); extern unsigned int nr_free_pagecache_pages(void); @@ -291,6 +290,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, struct vm_area_struct *vma, unsigned long addr); /* linux/mm/swapfile.c */ +extern long nr_swap_pages; extern long total_swap_pages; extern void si_swapinfo(struct sysinfo *); extern swp_entry_t get_swap_page(void); @@ -331,7 +331,8 @@ static inline void disable_swap_token(void) #else /* CONFIG_SWAP */ -#define total_swap_pages 0 +#define nr_swap_pages 0L +#define total_swap_pages 0L #define total_swapcache_pages 0UL #define si_swapinfo(val) \ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce2a026219bc..65133436fe3d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,7 +69,6 @@ EXPORT_SYMBOL(node_states); unsigned long totalram_pages __read_mostly; unsigned long totalreserve_pages __read_mostly; -long nr_swap_pages; int percpu_pagelist_fraction; #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE diff --git a/mm/swapfile.c b/mm/swapfile.c index 9ce7f81c8abc..725e56c362de 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -35,6 +35,7 @@ static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; +long nr_swap_pages; long total_swap_pages; static int swap_overflow; static int least_priority; diff --git a/mm/vmscan.c b/mm/vmscan.c index f350523a8eee..d500b906746c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1327,12 +1327,6 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, unsigned long anon_prio, file_prio; unsigned long ap, fp; - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); - /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { percent[0] = 0; @@ -1340,6 +1334,12 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, return; } + anon = zone_page_state(zone, NR_ACTIVE_ANON) + + zone_page_state(zone, NR_INACTIVE_ANON); + file = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + free = zone_page_state(zone, NR_FREE_PAGES); + /* If we have very few page cache pages, force-scan anon pages. */ if (unlikely(file + free <= zone->pages_high)) { percent[0] = 100; -- cgit v1.2.3 From 077cbc5864cd9188fa4c4e181e48ff58317e6400 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:42 -0800 Subject: memcg: reclaim shouldn't change zone->recent_rotated statistics memcg reclaim shouldn't change zone->recent_rotated statistics. If memcgroup reclaim changes zone statistics, global reclaim can get a bit confused. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Balbir Singh Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index d500b906746c..da7c3a2304a7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1246,7 +1246,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * This helps balance scan pressure between file and anonymous * pages in get_scan_ratio. */ - zone->recent_rotated[!!file] += pgmoved; + if (scan_global_lru(sc)) + zone->recent_rotated[!!file] += pgmoved; /* * Move the pages to the [file or anon] inactive list. -- cgit v1.2.3 From ff30153bf9647c8646538810d4c01015a5e44787 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:44 -0800 Subject: mm: make scan_all_zones_unevictable_pages() static sparse output following warning. mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index da7c3a2304a7..062767d159da 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2506,7 +2506,7 @@ void scan_zone_unevictable_pages(struct zone *zone) * that has possibly/probably made some previously unevictable pages * evictable. */ -void scan_all_zones_unevictable_pages(void) +static void scan_all_zones_unevictable_pages(void) { struct zone *zone; -- cgit v1.2.3 From 14b90b22ec0f359ef4791033ab386b2b627bae07 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:45 -0800 Subject: mm: make scan_zone_unevictable_pages() static sparse output following warning mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 062767d159da..1ef5a2eeb298 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2464,7 +2464,7 @@ void scan_mapping_unevictable_pages(struct address_space *mapping) * back onto @zone's unevictable list. */ #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */ -void scan_zone_unevictable_pages(struct zone *zone) +static void scan_zone_unevictable_pages(struct zone *zone) { struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list; unsigned long scan; -- cgit v1.2.3 From a79311c14eae4bb946a97af25f3e1b17d625985d Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 6 Jan 2009 14:40:01 -0800 Subject: vmscan: bail out of direct reclaim after swap_cluster_max pages When the VM is under pressure, it can happen that several direct reclaim processes are in the pageout code simultaneously. It also happens that the reclaiming processes run into mostly referenced, mapped and dirty pages in the first round. This results in multiple direct reclaim processes having a lower pageout priority, which corresponds to a higher target of pages to scan. This in turn can result in each direct reclaim process freeing many pages. Together, they can end up freeing way too many pages. This kicks useful data out of memory (in some cases more than half of all memory is swapped out). It also impacts performance by keeping tasks stuck in the pageout code for too long. A 30% improvement in hackbench has been observed with this patch. The fix is relatively simple: in shrink_zone() we can check how many pages we have already freed, direct reclaim tasks break out of the scanning loop if they have already freed enough pages and have reached a lower priority level. We do not break out of shrink_zone() when priority == DEF_PRIORITY, to ensure that equal pressure is applied to every zone in the common case. However, in order to do this we do need to know how many pages we already freed, so move nr_reclaimed into scan_control. akpm: a historical interlude... We tried this in 2004: :commit e468e46a9bea3297011d5918663ce6d19094cf87 :Author: akpm :Date: Thu Jun 24 15:53:52 2004 +0000 : :[PATCH] vmscan.c: dont reclaim too many pages : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly hit : a large number of reclaimable pages on the LRU. : Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. And we reverted it in 2006: :commit 210fe530305ee50cd889fe9250168228b2994f32 :Author: Andrew Morton :Date: Fri Jan 6 00:11:14 2006 -0800 : : [PATCH] vmscan: balancing fix : : Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly : hit a large number of reclaimable pages on the LRU. : : Change things so we bale out when SWAP_CLUSTER_MAX pages have been : reclaimed. : : Problem is, this change caused significant imbalance in inter-zone scan : balancing by truncating scans of larger zones. : : Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone : balancing algorithm would require that if we're scanning 100 pages of : ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will : cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are : reclaimed. Thus effectively causing smaller zones to be scanned relatively : harder than large ones. : : Now I need to remember what the workload was which caused me to write this : patch originally, then fix it up in a different way... And we haven't demonstrated that whatever problem caused that reversion is not being reintroduced by this change in 2008. Signed-off-by: Rik van Riel Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 62 ++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1ef5a2eeb298..5faa7739487f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -52,6 +52,9 @@ struct scan_control { /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; + /* Number of pages freed so far during a call to shrink_zones() */ + unsigned long nr_reclaimed; + /* This context's GFP mask */ gfp_t gfp_mask; @@ -1400,12 +1403,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ -static unsigned long shrink_zone(int priority, struct zone *zone, +static void shrink_zone(int priority, struct zone *zone, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; - unsigned long nr_reclaimed = 0; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; @@ -1446,10 +1448,21 @@ static unsigned long shrink_zone(int priority, struct zone *zone, (unsigned long)sc->swap_cluster_max); nr[l] -= nr_to_scan; - nr_reclaimed += shrink_list(l, nr_to_scan, + sc->nr_reclaimed += shrink_list(l, nr_to_scan, zone, sc, priority); } } + /* + * On large memory systems, scan >> priority can become + * really large. This is fine for the starting priority; + * we want to put equal scanning pressure on each zone. + * However, if the VM has a harder time of freeing pages, + * with multiple processes reclaiming pages, the total + * freeing target can get unreasonably large. + */ + if (sc->nr_reclaimed > sc->swap_cluster_max && + priority < DEF_PRIORITY && !current_is_kswapd()) + break; } /* @@ -1462,7 +1475,6 @@ static unsigned long shrink_zone(int priority, struct zone *zone, shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); throttle_vm_writeout(sc->gfp_mask); - return nr_reclaimed; } /* @@ -1476,16 +1488,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * b) The zones may be over pages_high but they must go *over* pages_high to * satisfy the `incremental min' zone defense algorithm. * - * Returns the number of reclaimed pages. - * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zonelist *zonelist, +static void shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); - unsigned long nr_reclaimed = 0; struct zoneref *z; struct zone *zone; @@ -1516,10 +1525,8 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist, priority); } - nr_reclaimed += shrink_zone(priority, zone, sc); + shrink_zone(priority, zone, sc); } - - return nr_reclaimed; } /* @@ -1544,7 +1551,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, int priority; unsigned long ret = 0; unsigned long total_scanned = 0; - unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; struct zoneref *z; @@ -1572,7 +1578,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zonelist, sc); + shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1580,13 +1586,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, if (scan_global_lru(sc)) { shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); if (reclaim_state) { - nr_reclaimed += reclaim_state->reclaimed_slab; + sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; } } total_scanned += sc->nr_scanned; - if (nr_reclaimed >= sc->swap_cluster_max) { - ret = nr_reclaimed; + if (sc->nr_reclaimed >= sc->swap_cluster_max) { + ret = sc->nr_reclaimed; goto out; } @@ -1609,7 +1615,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, } /* top priority shrink_zones still had more to do? don't OOM, then */ if (!sc->all_unreclaimable && scan_global_lru(sc)) - ret = nr_reclaimed; + ret = sc->nr_reclaimed; out: /* * Now that we've scanned all the zones at this priority level, note @@ -1704,7 +1710,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) int priority; int i; unsigned long total_scanned; - unsigned long nr_reclaimed; struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -1723,7 +1728,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) loop_again: total_scanned = 0; - nr_reclaimed = 0; + sc.nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); @@ -1809,11 +1814,11 @@ loop_again: */ if (!zone_watermark_ok(zone, order, 8*zone->pages_high, end_zone, 0)) - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); - nr_reclaimed += reclaim_state->reclaimed_slab; + sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; if (zone_is_all_unreclaimable(zone)) continue; @@ -1827,7 +1832,7 @@ loop_again: * even in laptop mode */ if (total_scanned > SWAP_CLUSTER_MAX * 2 && - total_scanned > nr_reclaimed + nr_reclaimed / 2) + total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; } if (all_zones_ok) @@ -1845,7 +1850,7 @@ loop_again: * matches the direct reclaim path behaviour in terms of impact * on zone->*_priority. */ - if (nr_reclaimed >= SWAP_CLUSTER_MAX) + if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; } out: @@ -1867,7 +1872,7 @@ out: goto loop_again; } - return nr_reclaimed; + return sc.nr_reclaimed; } /* @@ -2219,7 +2224,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct task_struct *p = current; struct reclaim_state reclaim_state; int priority; - unsigned long nr_reclaimed = 0; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), @@ -2252,9 +2256,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) priority = ZONE_RECLAIM_PRIORITY; do { note_zone_scanning_priority(zone, priority); - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); priority--; - } while (priority >= 0 && nr_reclaimed < nr_pages); + } while (priority >= 0 && sc.nr_reclaimed < nr_pages); } slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); @@ -2278,13 +2282,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * Update nr_reclaimed by the number of slab pages we * reclaimed from this zone. */ - nr_reclaimed += slab_reclaimable - + sc.nr_reclaimed += slab_reclaimable - zone_page_state(zone, NR_SLAB_RECLAIMABLE); } p->reclaim_state = NULL; current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); - return nr_reclaimed >= nr_pages; + return sc.nr_reclaimed >= nr_pages; } int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) -- cgit v1.2.3 From 01dbe5c9b1004dab045cb7f38428258ca9cddc02 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:02 -0800 Subject: vmscan: improve reclaim throughput to bail out patch The vmscan bail out patch move nr_reclaimed variable to struct scan_control. Unfortunately, indirect access can easily happen cache miss. if heavy memory pressure happend, that's ok. cache miss already plenty. it is not observable. but, if memory pressure is lite, performance degression is obserbable. I compared following three pattern (it was mesured 10 times each) hackbench 125 process 3000 hackbench 130 process 3000 hackbench 135 process 3000 2.6.28-rc6 bail-out 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 93.414 73.254 193.382 74.145 78.295 77.27 74.897 75.021 80.17 70.305 77.643 75.855 70.134 77.571 79.896 74.288 73.986 75.955 77.222 78.48 80.619 72.029 79.947 78.312 75.128 82.172 79.708 71.499 77.615 77.042 74.177 76.532 77.306 76.188 74.471 83.562 73.839 72.43 79.833 73.236 75.606 78.743 76.001 76.557 82.726 69.427 77.271 76.691 76.236 79.371 103.189 72.473 76.978 80.643 69.128 78.932 75.736 avg 72.545 76.767 78.534 76.017 77.03 93.256 std 1.89 1.71 2.41 6.29 2.79 34.16 min 69.427 73.986 75.855 69.128 72.43 75.736 max 76.188 79.947 83.562 93.414 82.172 193.382 about 4-5% degression. Then, this patch introduces a temporary local variable. result: 2.6.28-rc6 this patch num 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 67.302 68.269 77.161 74.145 78.295 77.27 72.616 72.712 79.06 70.305 77.643 75.855 72.475 75.712 77.735 74.288 73.986 75.955 69.229 73.062 78.814 72.029 79.947 78.312 71.551 74.392 78.564 71.499 77.615 77.042 69.227 74.31 78.837 76.188 74.471 83.562 70.759 75.256 76.6 73.236 75.606 78.743 69.966 76.001 78.464 69.427 77.271 76.691 69.068 75.218 80.321 72.473 76.978 80.643 72.057 77.151 79.068 avg 72.545 76.767 78.534 70.425 74.2083 78.462 std 1.89 1.71 2.41 1.66 2.34 1.00 min 69.427 73.986 75.855 67.302 68.269 76.6 max 76.188 79.947 83.562 72.616 77.151 80.321 OK. the degression is disappeared. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5faa7739487f..13f050d667e9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1410,6 +1410,8 @@ static void shrink_zone(int priority, struct zone *zone, unsigned long nr_to_scan; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; + unsigned long nr_reclaimed = sc->nr_reclaimed; + unsigned long swap_cluster_max = sc->swap_cluster_max; get_scan_ratio(zone, sc, percent); @@ -1425,7 +1427,7 @@ static void shrink_zone(int priority, struct zone *zone, } zone->lru[l].nr_scan += scan; nr[l] = zone->lru[l].nr_scan; - if (nr[l] >= sc->swap_cluster_max) + if (nr[l] >= swap_cluster_max) zone->lru[l].nr_scan = 0; else nr[l] = 0; @@ -1444,12 +1446,11 @@ static void shrink_zone(int priority, struct zone *zone, nr[LRU_INACTIVE_FILE]) { for_each_evictable_lru(l) { if (nr[l]) { - nr_to_scan = min(nr[l], - (unsigned long)sc->swap_cluster_max); + nr_to_scan = min(nr[l], swap_cluster_max); nr[l] -= nr_to_scan; - sc->nr_reclaimed += shrink_list(l, nr_to_scan, - zone, sc, priority); + nr_reclaimed += shrink_list(l, nr_to_scan, + zone, sc, priority); } } /* @@ -1460,11 +1461,13 @@ static void shrink_zone(int priority, struct zone *zone, * with multiple processes reclaiming pages, the total * freeing target can get unreasonably large. */ - if (sc->nr_reclaimed > sc->swap_cluster_max && + if (nr_reclaimed > swap_cluster_max && priority < DEF_PRIORITY && !current_is_kswapd()) break; } + sc->nr_reclaimed = nr_reclaimed; + /* * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. -- cgit v1.2.3 From 09f445e7f5107c91be12ed386350de6cd055e0a4 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:03 -0800 Subject: mm: kill zone_is_near_oom() zone_is_near_oom() is unused. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 13f050d667e9..466a36b3bada 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1167,11 +1167,6 @@ static inline void note_zone_scanning_priority(struct zone *zone, int priority) zone->prev_priority = priority; } -static inline int zone_is_near_oom(struct zone *zone) -{ - return zone->pages_scanned >= (zone_lru_pages(zone) * 3); -} - /* * This moves pages from the active list to the inactive list. * -- cgit v1.2.3 From b555749aac87d7c2637f153e44bd77c7fdf4c65b Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 6 Jan 2009 14:40:13 -0800 Subject: vmscan: shrink_active_list(): reduce lru_lock hold time These three statements manipulate local variables and do not need the lock coverage. Cc: Johannes Weiner Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Linus Torvalds --- mm/vmscan.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 466a36b3bada..5daf606e0a35 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1237,6 +1237,13 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, list_add(&page->lru, &l_inactive); } + /* + * Move the pages to the [file or anon] inactive list. + */ + pagevec_init(&pvec, 1); + pgmoved = 0; + lru = LRU_BASE + file * LRU_FILE; + spin_lock_irq(&zone->lru_lock); /* * Count referenced pages from currently used mappings as @@ -1247,13 +1254,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, if (scan_global_lru(sc)) zone->recent_rotated[!!file] += pgmoved; - /* - * Move the pages to the [file or anon] inactive list. - */ - pagevec_init(&pvec, 1); - - pgmoved = 0; - lru = LRU_BASE + file * LRU_FILE; while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); -- cgit v1.2.3 From 73ce02e96fe34a983199a9855b2ae738f960a6ee Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:33 -0800 Subject: mm: stop kswapd's infinite loop at high order allocation Wassim Dagash reported following kswapd infinite loop problem. kswapd runs in some infinite loop trying to swap until order 10 of zone highmem is OK.... kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem). For non order-0 allocations, the system may never be balanced due to fragmentation but kswapd should not infinitely loop as a result. Instead, recheck all watermarks at order-0 as they are the most important. If watermarks are ok, kswapd will go back to sleep. [akpm@linux-foundation.org: fix comment] Reported-by: wassim dagash Signed-off-by: KOSAKI Motohiro Reviewed-by: Nick Piggin Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'mm/vmscan.c') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5daf606e0a35..b07c48b09a93 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1867,6 +1867,23 @@ out: try_to_freeze(); + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = sc.order = 0; + goto loop_again; } -- cgit v1.2.3