Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc
Pull MMC updates from Chris Ball:
"MMC highlights for 3.8:
Core:
- Expose access to the eMMC RPMB ("Replay Protected Memory Block")
area by extending the existing mmc_block ioctl.
- Add SDIO powered-suspend DT properties to the core MMC DT binding.
- Add no-1-8-v DT flag for boards where the SD controller reports
that it supports 1.8V but the board itself has no way to switch to
1.8V.
- More work on switching to 1.8V UHS support using a vqmmc regulator.
- Fix up a case where the slot-gpio helper may fail to reset the host
controller properly if a card was removed during a transfer.
- Fix several cases where a broken device could cause an infinite
loop while we wait for a register to update.
Drivers:
- at91-mci: Remove obsolete driver, atmel-mci handles these devices
now.
- sdhci-dove: Allow using GPIOs for card-detect notifications.
- sdhci-esdhc: Fix for recovering from ADMA errors on broken silicon.
- sdhci-s3c: Add pinctrl support.
- wmt-sdmmc: New driver for WonderMedia SD/MMC controllers."
* tag 'mmc-updates-for-3.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/cjb/mmc: (65 commits)
mmc: sdhci: implement the .card_event() method
mmc: extend the slot-gpio card-detection to use host's .card_event() method
mmc: add a card-event host operation
mmc: sdhci-s3c: Fix compilation warning
mmc: sdhci-pci: Enable SDHCI_CAN_DO_HISPD for Ricoh SDHCI controller
mmc: sdhci-dove: allow GPIOs to be used for card detection on Dove
mmc: sdhci-dove: use two-stage initialization for sdhci-pltfm
mmc: sdhci-dove: use devm_clk_get()
mmc: eSDHC: Recover from ADMA errors
mmc: dw_mmc: remove duplicated buswidth code
mmc: dw_mmc: relocate where dw_mci_setup_bus() is called from
mmc: Limit MMC speed to 52MHz if not HS200
mmc: dw_mmc: use devres functions in dw_mmc
mmc: sh_mmcif: remove unneeded clock connection ID
mmc: sh_mobile_sdhi: remove unneeded clock connection ID
mmc: sh_mobile_sdhi: fix clock frequency printing
mmc: Remove redundant null check before kfree in bus.c
mmc: Remove redundant null check before kfree in sdio_bus.c
mmc: sdhci-imx-esdhc: use more devm_* functions
mmc: dt: add no-1-8-v device tree flag
...
|
|
__copy_skb_header(nskb, p) already copied p->cb[], no need to copy
it again.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In case of rehashing, introduce a global variable 'br_mdb_rehash_seq'
which gets increased every time when rehashing, and assign
net->dev_base_seq + br_mdb_rehash_seq to cb->seq.
In theory cb->seq could be wrapped to zero, but this is not
easy to fix, as net->dev_base_seq is not visible inside
br_mdb_rehash(). In practice, this is rare.
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
'disk_key' is not used at all.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The argument 'tree_mod_log' is not necessary since all of callers enable it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).
So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.
Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.
- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.
- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,
Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
|
|
The {read,write}s{b,w,l} operations are not defined by all
architectures and are being removed from the asm-generic/io.h
interface.
This patch replaces the usage of these string functions in the smc911x
accessors with io{read,write}{8,16,32}_rep calls instead.
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Matthew Leach <matthew@mattleach.net>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch removes the redundant occurences of simple_strto<foo>
Signed-off-by: Abhijit Pawar <abhi.c.pawar@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is a pure software device, and ok with live address change.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
__napi_gro_receive() is inlined from two call sites for no good reason.
Lets move the prep stuff in a function of its own, called only if/when
needed. This saves 300 bytes on x86 :
# size net/core/dev.o.after net/core/dev.o.before
text data bss dec hex filename
51968 1238 1040 54246 d3e6 net/core/dev.o.before
51664 1238 1040 53942 d2b6 net/core/dev.o.after
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Dmitry Kravkov reported packet drops for GRE packets since GRO support
was added.
There is a race in gro_cell_poll() because we call napi_complete()
without any synchronization with a concurrent gro_cells_receive()
Once bug was triggered, we queued packets but did not schedule NAPI
poll.
We can fix this issue using the spinlock protected the napi_skbs queue,
as we have to hold it to perform skb dequeue anyway.
As we open-code skb_dequeue(), we no longer need to mask IRQS, as both
producer and consumer run under BH context.
Bug added in commit c9e6bc644e (net: add gro_cells infrastructure)
Reported-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Using netdev_alloc_frag() instead of kmalloc() permits better GRO or
TCP coalescing behavior, as skb_gro_receive() doesn't have to fallback
to frag_list overhead.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dmitry Kravkov <dmitry@broadcom.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Acked-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If we have a read oplock and set a read lock in it, we can't write to the
locked area - so, filemap_fdatawrite may fail with a no information for a
userspace application even if we request a write to non-locked area. Fix
this by populating the page cache without marking affected pages dirty
after a successful write directly to the server.
Also remove CONFIG_CIFS_SMB2 ifdefs because it's suitable for both CIFS
and SMB2 protocols.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
This should fix a regression that was introduced when the new mount
option parser went in. Also, when the unc= and prefixpath= options
are provided, check their values against the ones we parsed from
the device string. If they differ, then throw a warning that tells
the user that we're using the values from the unc= option for now,
but that that will change in 3.10.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Currently the code takes care to ensure that the prefixpath has a
leading '/' delimiter. What if someone passes us a prefixpath with a
leading '\\' instead? The code doesn't properly handle that currently
AFAICS.
Let's just change the code to skip over any leading delimiter character
when copying the prepath. Then, fix up the users of the prepath option
to prefix it with the correct delimiter when they use it.
Also, there's no need to limit the length of the prefixpath to 1k. If
the server can handle it, why bother forbidding it?
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
Make sure we free any existing memory allocated for vol->UNC, just in
case someone passes in multiple unc= options.
Get rid of the check for too long a UNC. The check for >300 bytes seems
arbitrary. We later copy this into the tcon->treeName, for instance and
it's a lot shorter than 300 bytes.
Eliminate an extra kmalloc and copy as well. Just set the vol->UNC
directly with the contents of match_strdup.
Establish that the UNC should be stored with '\\' delimiters. Use
convert_delimiter to change it in place in the vol->UNC.
Finally, move the check for a malformed UNC into
cifs_parse_mount_options so we can catch that situation earlier.
Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
The authority fields are supposed to be represented by a single 48-bit
value. It's also supposed to represent the value as hex if it's equal to
or greater than 2^32. This is documented in MS-DTYP, section 2.4.2.1.
Also, fix up the max string length to account for this fix.
Acked-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
|
|
The "struct device_node *" argument of of_parse_phandle_with_args() can
be const. Making this change makes it explicit that the function will
not modify a node.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
[grant.likely: Resolved conflict with previous patch modifying of_parse_phandle()]
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
|
If CONFIG_OF_I2C and CONFIG_OF_I2C_MODULE are undefined no declaration of
of_find_i2c_device_by_node and of_find_i2c_adapter_by_node will be
available. Add dummy inline functions to avoid compilation breakage.
Signed-off-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
|
|
If the server sends us a target that looks like an outlier, but
is lower than the existing target, then respect it anyway.
However defer actually updating the generation counter until
we get a target that doesn't look like an outlier.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
if ore_write() fails, we would unlock the pages of pcol, which is now
empty, rather than pcol_copy which owns the pages when ore_write() is
called. this means that no pages will actually be unlocked
(pcol.nr_pages == 0) and the writing process (more accurately, the
syncing process) will hang waiting for a writeback notification that
never comes.
moreover, if ore_write() fails, pcol_free() is called for pcol, whereas
pcol_copy is the object owning the ore_io_state, thus leaking the
ore_io_state.
[Boaz]
I have simplified Idan's original patch a bit, everything else still
holds
Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
|
|
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to
respect the server's sr_highest_slotid limit. This mainly happens
due to reordered RPC requests.
The way to handle it is simply to drop the slot that we're using,
and retry using the new highest_slotid limits.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.
Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.
Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Convert the struct anon_vma::mutex to an rwsem, which will help
in solving a page-migration scalability problem. (Addressed in
a separate patch.)
The conversion is simple and straightforward: in every case
where we mutex_lock()ed we'll now down_write().
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
If there is excessive migration due to NUMA balancing it gets rate
limited. It does this by counting the number of pages it has migrated
recently but counts a transhuge page as 1 page. Account for it properly.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Subject says it all. Allocation failures and a failure to isolate should
be accounted as a migration failure. This is partially another
difference between base page and transhuge page migration. A base page
migration makes multiple attempts for these conditions before it would
be accounted for as a failure.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
build fix
Commit "Add THP migration for the NUMA working set scanning fault case"
breaks the build because HPAGE_PMD_SHIFT and HPAGE_PMD_MASK defined to
explode without CONFIG_TRANSPARENT_HUGEPAGE:
mm/migrate.c: In function 'migrate_misplaced_transhuge_page_put':
mm/migrate.c:1549: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1564: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1566: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1573: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1606: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
mm/migrate.c:1648: error: call to '__build_bug_failed' declared with attribute error: BUILD_BUG failed
CONFIG_NUMA_BALANCING allows compilation without enabling transparent
hugepages, so define the dummy function for such a configuration and only
define migrate_misplaced_transhuge_page_put() when transparent hugepages
are enabled.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Note: This is very heavily based on a patch from Peter Zijlstra with
fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch
put a lot of migration logic into mm/huge_memory.c where it does
not belong. This version puts tries to share some of the migration
logic with migrate_misplaced_page. However, it should be noted
that now migrate.c is doing more with the pagetable manipulation
than is preferred. The end result is barely recognisable so as
before, the signed-offs had to be removed but will be re-added if
the original authors are ok with it.
Add THP migration for the NUMA working set scanning fault case.
It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.
[dhillf@gmail.com: Fix memory leak on isolation failure]
[dhillf@gmail.com: Fix transfer of last_nid information]
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Due to the fact that migrations are driven by the CPU a task is running
on there is no point tracking NUMA faults until one task runs on a new
node. This patch tracks the first node used by an address space. Until
it changes, PTE scanning is disabled and no NUMA hinting faults are
trapped. This should help workloads that are short-lived, do not care
about NUMA placement or have bound themselves to a single node.
This takes advantage of the logic in "mm: sched: numa: Implement slow
start for working set sampling" to delay when the checks are made. This
will take advantage of processes that set their CPU and node bindings
early in their lifetime. It will also potentially allow any initial load
balancing to take place.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
!SCHED_DEBUG
The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
depends on scheduling debug being enabled but it's perfectly legimate to
disable automatic NUMA balancing even without this option. This should
take care of it.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This patch adds Kconfig options and kernel parameters to allow the
enabling and disabling of automatic NUMA balancing. The existance
of such a switch was and is very important when debugging problems
related to transparent hugepages and we should have the same for
automatic NUMA placement.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
The PTE scanning rate and fault rates are two of the biggest sources of
system CPU overhead with automatic NUMA placement. Ideally a proper policy
would detect if a workload was properly placed, schedule and adjust the
PTE scanning rate accordingly. We do not track the necessary information
to do that but we at least know if we migrated or not.
This patch scans slower if a page was not migrated as the result of a
NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
now higher than the previous default. Once every minute it will reset
the scanner in case of phase changes.
This is hilariously crude and the numbers are arbitrary. Workloads will
converge quite slowly in comparison to what a proper policy should be able
to do. On the plus side, we will chew up less CPU for workloads that have
no need for automatic balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
unlikely task<->node relationships
Note: This two-stage filter was taken directly from the sched/numa patch
"sched, numa, mm: Add the scanning page fault machinery" but is
only a partial extraction. As the end result is not necessarily
recognisable, the signed-offs-by had to be removed. Will be added
back if requested.
While it is desirable that all threads in a process run on its home
node, this is not always possible or necessary. There may be more
threads than exist within the node or the node might over-subscribed
with unrelated processes.
This can cause a situation whereby a page gets migrated off its home
node because the threads clearing pte_numa were running off-node. This
patch uses page->last_nid to build a two-stage filter before pages get
migrated to avoid problems with short or unlikely task<->node
relationships.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Pass last_nid from misplaced page to newly allocated migration target page.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Pass last_nid from head page to tail page.
Signed-off-by: Hillf Danton <dhillf@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This patch introduces a last_nid field to the page struct. This is used
to build a two-stage filter in the next patch that is aimed at
mitigating a problem whereby pages migrate to the wrong node when
referenced by a process that was running off its home node.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
Currently the rate of scanning for an address space is controlled
by the individual tasks. The next scan is simply determined by
2*p->numa_scan_period.
The 2*p->numa_scan_period is arbitrary and never changes. At this point
there is still no proper policy that decides if a task or process is
properly placed. It just scans and assumes the next NUMA fault will
place it properly. As it is assumed that pages will get properly placed
over time, increase the scan window each time a fault is incurred. This
is a big assumption as noted in the comments.
It should be noted that changing to p->numa_scan_period will increase
system CPU usage because now the scanning rate has effectively doubled.
If that is a problem then the min_rate should be made 200ms instead of
restoring the 2* logic.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
NOTE: This is very heavily based on similar logic in autonuma. It should
be signed off by Andrea but because there was no standalone
patch and it's sufficiently different from what he did that
the signed-off is omitted. Will be added back if requested.
If a large number of pages are misplaced then the memory bus can be
saturated just migrating pages between nodes. This patch rate-limits
the amount of memory that can be migrating between nodes.
Signed-off-by: Mel Gorman <mgorman@suse.de>
|
|
This defines the per-node data used by Migrate On Fault in order to
rate limit the migration. The rate limiting is applied independently
to each destination node.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
|