summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-12-11Merge tag 'mmc-updates-for-3.8-rc1' of ↵Linus Torvalds52-1722/+1781
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc Pull MMC updates from Chris Ball: "MMC highlights for 3.8: Core: - Expose access to the eMMC RPMB ("Replay Protected Memory Block") area by extending the existing mmc_block ioctl. - Add SDIO powered-suspend DT properties to the core MMC DT binding. - Add no-1-8-v DT flag for boards where the SD controller reports that it supports 1.8V but the board itself has no way to switch to 1.8V. - More work on switching to 1.8V UHS support using a vqmmc regulator. - Fix up a case where the slot-gpio helper may fail to reset the host controller properly if a card was removed during a transfer. - Fix several cases where a broken device could cause an infinite loop while we wait for a register to update. Drivers: - at91-mci: Remove obsolete driver, atmel-mci handles these devices now. - sdhci-dove: Allow using GPIOs for card-detect notifications. - sdhci-esdhc: Fix for recovering from ADMA errors on broken silicon. - sdhci-s3c: Add pinctrl support. - wmt-sdmmc: New driver for WonderMedia SD/MMC controllers." * tag 'mmc-updates-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (65 commits) mmc: sdhci: implement the .card_event() method mmc: extend the slot-gpio card-detection to use host's .card_event() method mmc: add a card-event host operation mmc: sdhci-s3c: Fix compilation warning mmc: sdhci-pci: Enable SDHCI_CAN_DO_HISPD for Ricoh SDHCI controller mmc: sdhci-dove: allow GPIOs to be used for card detection on Dove mmc: sdhci-dove: use two-stage initialization for sdhci-pltfm mmc: sdhci-dove: use devm_clk_get() mmc: eSDHC: Recover from ADMA errors mmc: dw_mmc: remove duplicated buswidth code mmc: dw_mmc: relocate where dw_mci_setup_bus() is called from mmc: Limit MMC speed to 52MHz if not HS200 mmc: dw_mmc: use devres functions in dw_mmc mmc: sh_mmcif: remove unneeded clock connection ID mmc: sh_mobile_sdhi: remove unneeded clock connection ID mmc: sh_mobile_sdhi: fix clock frequency printing mmc: Remove redundant null check before kfree in bus.c mmc: Remove redundant null check before kfree in sdio_bus.c mmc: sdhci-imx-esdhc: use more devm_* functions mmc: dt: add no-1-8-v device tree flag ...
2012-12-11net: gro: avoid double copy in skb_gro_receive()Eric Dumazet1-1/+0
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy it again. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bridge: fix seq check in br_mdb_dump()Cong Wang3-4/+5
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq' which gets increased every time when rehashing, and assign net->dev_base_seq + br_mdb_rehash_seq to cb->seq. In theory cb->seq could be wrapped to zero, but this is not easy to fix, as net->dev_base_seq is not visible inside br_mdb_rehash(). In practice, this is rare. Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11Btrfs: make ordered extent be flushed by multi-taskMiao Xie2-9/+37
Though the process of the ordered extents is a bit different with the delalloc inode flush, but we can see it as a subset of the delalloc inode flush, so we also handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: make ordered operations be handled by multi-taskMiao Xie3-21/+45
The process of the ordered operations is similar to the delalloc inode flush, so we handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: make delalloc inodes be flushed by multi-taskMiao Xie5-8/+103
This patch introduce a new worker pool named "flush_workers", and if we want to force all the inode with pending delalloc to the disks, we can queue those inodes into the work queue of the worker pool, in this way, those inodes will be flushed by multi-task. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fill the global reserve when unpinning spaceJosef Bacik1-5/+24
Dave gave me an image of a very full file system that would abort the transaction because it ran out of space while committing the transaction. This is because we would think there was plenty of room to create a snapshot even though the global reserve was not full. This happens because we calculate the global reserve size before we unpin any space, so after we unpin the space we allow reservations to occur even though we haven't reserved all of the space for our global reserve. Fix this by adding to the global reserve while unpinning in order to make sure we always have enough space to do our work. With this patch we no longer end up with an aborted transaction, we return ENOSPC properly to the person trying to create the snapshot. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: cleanup unused argumentsLiu Bo1-7/+6
'disk_key' is not used at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: kill unnecessary arguments in del_ptrLiu Bo1-9/+7
The argument 'tree_mod_log' is not necessary since all of callers enable it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: reorder tree mod log operations in deleting a pointerLiu Bo1-4/+6
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritemsLiu Bo1-2/+2
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside an extent buffer node, and the node's number of items remains unchanged (unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that). So we don't need to increase node's number of items during rewinding, otherwise we may get an node larger than leafsize and cause general protection errors later. Here is the details, - If we do memory move for inserting a single pointer, we need to add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding. - If we do memory move for deleting a single pointer, we need to decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for deleting. - If we do memory move for balance left/right, we need to decrease node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fix unnecessary while loop when search the free space, cacheMiao Xie1-20/+10
When we find a bitmap free space entry, we may check the previous extent entry covers the offset or not. But if we find this entry is also a bitmap entry, we will continue to check the previous entry of the current one by a while loop. It is unnecessary because it is impossible that the extent entry which is in front of a bitmap entry can cover the offset of the entry after that bitmap entry. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: recheck bio against block device when we map the bioJosef Bacik1-28/+131
Alex reported a problem where we were writing between chunks on a rbd device. The thing is we do bio_add_page using logical offsets, but the physical offset may be different. So when we map the bio now check to see if the bio is still ok with the physical offset, and if it is not split the bio up and redo the bio_add_page with the physical sector. This fixes the problem for Alex and doesn't affect performance in the normal case. Thanks, Reported-and-tested-by: Alex Elder <elder@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: improve the noflush reservationMiao Xie8-86/+97
In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: fix wrong comment in can_overcommit()Miao Xie1-3/+3
The comment is not coincident with the code. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11Btrfs: cleanup duplicated division functionsMiao Xie3-40/+46
div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11net: smc911x: use io{read,write}*_rep accessorsMatthew Leach2-12/+12
The {read,write}s{b,w,l} operations are not defined by all architectures and are being removed from the asm-generic/io.h interface. This patch replaces the usage of these string functions in the smc911x accessors with io{read,write}{8,16,32}_rep calls instead. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ben Herrenschmidt <benh@kernel.crashing.org> Cc: netdev@vger.kernel.org Signed-off-by: Matthew Leach <matthew@mattleach.net> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: remove obsolete simple_strto<foo>Abhijit Pawar3-3/+0
This patch removes the redundant occurences of simple_strto<foo> Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11tun: allow setting ethernet addresss while runningstephen hemminger1-0/+1
This is a pure software device, and ok with live address change. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: gro: dev_gro_receive() cleanupEric Dumazet1-26/+26
__napi_gro_receive() is inlined from two call sites for no good reason. Lets move the prep stuff in a function of its own, called only if/when needed. This saves 300 bytes on x86 : # size net/core/dev.o.after net/core/dev.o.before text data bss dec hex filename 51968 1238 1040 54246 d3e6 net/core/dev.o.before 51664 1238 1040 53942 d2b6 net/core/dev.o.after Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11net: fix a race in gro_cell_poll()Eric Dumazet1-5/+9
Dmitry Kravkov reported packet drops for GRE packets since GRO support was added. There is a race in gro_cell_poll() because we call napi_complete() without any synchronization with a concurrent gro_cells_receive() Once bug was triggered, we queued packets but did not schedule NAPI poll. We can fix this issue using the spinlock protected the napi_skbs queue, as we have to hold it to perform skb dequeue anyway. As we open-code skb_dequeue(), we no longer need to mask IRQS, as both producer and consumer run under BH context. Bug added in commit c9e6bc644e (net: add gro_cells infrastructure) Reported-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11bnx2x: use netdev_alloc_frag()Eric Dumazet2-13/+32
Using netdev_alloc_frag() instead of kmalloc() permits better GRO or TCP coalescing behavior, as skb_gro_receive() doesn't have to fallback to frag_list overhead. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Dmitry Kravkov <dmitry@broadcom.com> Cc: Eilon Greenstein <eilong@broadcom.com> Acked-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-12-11CIFS: Fix write after setting a read lock for read oplock filesPavel Shilovsky3-31/+65
If we have a read oplock and set a read lock in it, we can't write to the locked area - so, filemap_fdatawrite may fail with a no information for a userspace application even if we request a write to non-locked area. Fix this by populating the page cache without marking affected pages dirty after a successful write directly to the server. Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS and SMB2 protocols. Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: parse the device name into UNC and prepathJeff Layton1-7/+88
This should fix a regression that was introduced when the new mount option parser went in. Also, when the unc= and prefixpath= options are provided, check their values against the ones we parsed from the device string. If they differ, then throw a warning that tells the user that we're using the values from the unc= option for now, but that that will change in 3.10. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix up handling of prefixpath= optionJeff Layton2-27/+12
Currently the code takes care to ensure that the prefixpath has a leading '/' delimiter. What if someone passes us a prefixpath with a leading '\\' instead? The code doesn't properly handle that currently AFAICS. Let's just change the code to skip over any leading delimiter character when copying the prepath. Then, fix up the users of the prepath option to prefix it with the correct delimiter when they use it. Also, there's no need to limit the length of the prefixpath to 1k. If the server can handle it, why bother forbidding it? Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: clean up handling of unc= optionJeff Layton1-27/+12
Make sure we free any existing memory allocated for vol->UNC, just in case someone passes in multiple unc= options. Get rid of the check for too long a UNC. The check for >300 bytes seems arbitrary. We later copy this into the tcon->treeName, for instance and it's a lot shorter than 300 bytes. Eliminate an extra kmalloc and copy as well. Just set the vol->UNC directly with the contents of match_strdup. Establish that the UNC should be stored with '\\' delimiters. Use convert_delimiter to change it in place in the vol->UNC. Finally, move the check for a malformed UNC into cifs_parse_mount_options so we can catch that situation earlier. Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11cifs: fix SID binary to string conversionJeff Layton2-9/+24
The authority fields are supposed to be represented by a single 48-bit value. It's also supposed to represent the value as hex if it's equal to or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1. Also, fix up the max string length to account for this fix. Acked-by: Pavel Shilovsky <piastry@etersoft.ru> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2012-12-11of: *node argument to of_parse_phandle_with_args should be constGuennadi Liakhovetski2-2/+2
The "struct device_node *" argument of of_parse_phandle_with_args() can be const. Making this change makes it explicit that the function will not modify a node. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> [grant.likely: Resolved conflict with previous patch modifying of_parse_phandle()] Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11of/i2c: add dummy inline functions for when CONFIG_OF_I2C(_MODULE) isn't definedGuennadi Liakhovetski1-0/+12
If CONFIG_OF_I2C and CONFIG_OF_I2C_MODULE are undefined no declaration of of_find_i2c_device_by_node and of_find_i2c_adapter_by_node will be available. Add dummy inline functions to avoid compilation breakage. Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-12-11NFSv4.1: Be conservative about the client highest slotidTrond Myklebust1-6/+16
If the server sends us a target that looks like an outlier, but is lower than the existing target, then respect it anyway. However defer actually updating the generation counter until we get a target that doesn't look like an outlier. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11exofs: clean up the correct page collection on write errorIdan Kedar1-2/+4
if ore_write() fails, we would unlock the pages of pcol, which is now empty, rather than pcol_copy which owns the pages when ore_write() is called. this means that no pages will actually be unlocked (pcol.nr_pages == 0) and the writing process (more accurately, the syncing process) will hang waiting for a writeback notification that never comes. moreover, if ore_write() fails, pcol_free() is called for pcol, whereas pcol_copy is the object owning the ore_io_state, thus leaking the ore_io_state. [Boaz] I have simplified Idan's original patch a bit, everything else still holds Signed-off-by: Idan Kedar <idank@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-12-11NFSv4.1: Handle NFS4ERR_BADSLOT errors correctlyTrond Myklebust1-1/+12
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to respect the server's sr_highest_slotid limit. This mainly happens due to reordered RPC requests. The way to handle it is simply to drop the slot that we're using, and retry using the new highest_slotid limits. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalableIngo Molnar9-39/+50
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm/rmap: Convert the struct anon_vma::mutex to an rwsemIngo Molnar4-25/+25
Convert the struct anon_vma::mutex to an rwsem, which will help in solving a page-migration scalability problem. (Addressed in a separate patch.) The conversion is simple and straightforward: in every case where we mutex_lock()ed we'll now down_write(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: migrate: Account a transhuge page properly when rate limitingMel Gorman1-4/+4
If there is excessive migration due to NUMA balancing it gets rate limited. It does this by counting the number of pages it has migrated recently but counts a transhuge page as 1 page. Account for it properly. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Account for failed allocations and isolations as migration failuresMel Gorman1-1/+4
Subject says it all. Allocation failures and a failure to isolate should be accounted as a migration failure. This is partially another difference between base page and transhuge page migration. A base page migration makes multiple attempts for these conditions before it would be accounted for as a failure. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case ↵Mel Gorman2-7/+11
build fix Commit "Add THP migration for the NUMA working set scanning fault case" breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to explode without CONFIG_TRANSPARENT_HUGEPAGE: mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put': mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed CONFIG_NUMA_BALANCING allows compilation without enabling transparent hugepages, so define the dummy function for such a configuration and only define migrate_misplaced_transhuge_page_put() when transparent hugepages are enabled. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Add THP migration for the NUMA working set scanning fault case.Mel Gorman5-64/+255
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Delay PTE scanning until a task is scheduled on a new nodeMel Gorman4-1/+34
Due to the fact that migrations are driven by the CPU a task is running on there is no point tracking NUMA faults until one task runs on a new node. This patch tracks the first node used by an address space. Until it changes, PTE scanning is disabled and no NUMA hinting faults are trapped. This should help workloads that are short-lived, do not care about NUMA placement or have bound themselves to a single node. This takes advantage of the logic in "mm: sched: numa: Implement slow start for working set sampling" to delay when the checks are made. This will take advantage of processes that set their CPU and node bindings early in their lifetime. It will also potentially allow any initial load balancing to take place. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancing if ↵Mel Gorman2-1/+16
!SCHED_DEBUG The "mm: sched: numa: Control enabling and disabling of NUMA balancing" depends on scheduling debug being enabled but it's perfectly legimate to disable automatic NUMA balancing even without this option. This should take care of it. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: numa: Control enabling and disabling of NUMA balancingMel Gorman7-17/+101
This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrateMel Gorman7-15/+44
The PTE scanning rate and fault rates are two of the biggest sources of system CPU overhead with automatic NUMA placement. Ideally a proper policy would detect if a workload was properly placed, schedule and adjust the PTE scanning rate accordingly. We do not track the necessary information to do that but we at least know if we migrated or not. This patch scans slower if a page was not migrated as the result of a NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is now higher than the previous default. Once every minute it will reset the scanner in case of phase changes. This is hilariously crude and the numbers are arbitrary. Workloads will converge quite slowly in comparison to what a proper policy should be able to do. On the plus side, we will chew up less CPU for workloads that have no need for automatic balancing. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Use a two-stage filter to restrict pages being migrated for ↵Mel Gorman1-1/+29
unlikely task<->node relationships Note: This two-stage filter was taken directly from the sched/numa patch "sched, numa, mm: Add the scanning page fault machinery" but is only a partial extraction. As the end result is not necessarily recognisable, the signed-offs-by had to be removed. Will be added back if requested. While it is desirable that all threads in a process run on its home node, this is not always possible or necessary. There may be more threads than exist within the node or the node might over-subscribed with unrelated processes. This can cause a situation whereby a page gets migrated off its home node because the threads clearing pte_numa were running off-node. This patch uses page->last_nid to build a two-stage filter before pages get migrated to avoid problems with short or unlikely task<->node relationships. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: migrate: Set last_nid on newly allocated pageHillf Danton1-0/+3
Pass last_nid from misplaced page to newly allocated migration target page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: split_huge_page: Transfer last_nid on tail pageHillf Danton1-0/+1
Pass last_nid from head page to tail page. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Introduce last_nid to the page frameMel Gorman3-0/+36
This patch introduces a last_nid field to the page struct. This is used to build a two-stage filter in the next patch that is aimed at mitigating a problem whereby pages migrate to the wrong node when referenced by a process that was running off its home node. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11sched: numa: Slowly increase the scanning period as NUMA faults are handledMel Gorman1-1/+10
Currently the rate of scanning for an address space is controlled by the individual tasks. The next scan is simply determined by 2*p->numa_scan_period. The 2*p->numa_scan_period is arbitrary and never changes. At this point there is still no proper policy that decides if a task or process is properly placed. It just scans and assumes the next NUMA fault will place it properly. As it is assumed that pages will get properly placed over time, increase the scan window each time a fault is incurred. This is a big assumption as noted in the comments. It should be noted that changing to p->numa_scan_period will increase system CPU usage because now the scanning rate has effectively doubled. If that is a problem then the min_rate should be made 200ms instead of restoring the 2* logic. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit setting of pte_numa if node is saturatedMel Gorman3-0/+35
If there are a large number of NUMA hinting faults and all of them are resulting in migrations it may indicate that memory is just bouncing uselessly around. NUMA balancing cost is likely exceeding any benefit from locality. Rate limit the PTE updates if the node is migration rate-limited. As noted in the comments, this distorts the NUMA faulting statistics. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Rate limit the amount of memory that is migrated between nodesMel Gorman1-1/+29
NOTE: This is very heavily based on similar logic in autonuma. It should be signed off by Andrea but because there was no standalone patch and it's sufficiently different from what he did that the signed-off is omitted. Will be added back if requested. If a large number of pages are misplaced then the memory bus can be saturated just migrating pages between nodes. This patch rate-limits the amount of memory that can be migrating between nodes. Signed-off-by: Mel Gorman <mgorman@suse.de>
2012-12-11mm: numa: Structures for Migrate On Fault per NUMA migration rate limitingAndrea Arcangeli2-0/+18
This defines the per-node data used by Migrate On Fault in order to rate limit the migration. The rate limiting is applied independently to each destination node. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de>