Age | Commit message (Collapse) | Author | Files | Lines |
|
The irqsave variant of refcount_dec_and_lock handles irqsave/restore when
taking/releasing the spin lock. With this variant the call of
local_irq_save/restore is no longer required.
[bigeasy@linutronix.de: s@atomic_dec_and_lock@refcount_dec_and_lock@g]
Link: http://lkml.kernel.org/r/20180703200141.28415-5-bigeasy@linutronix.de
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
refcount_t type and corresponding API should be used instead of atomic_t
when the variable is used as a reference counter. This permits avoiding
accidental refcounter overflows that might lead to use-after-free
situations.
Link: http://lkml.kernel.org/r/20180703200141.28415-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, percpu memory only exposes allocation and utilization
information via debugfs. This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters. This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory. Let's make it easier for someone to
identify how much memory is being used.
This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use. This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level. Metadata is excluded. I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata. It also makes this calculation light.
Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.
Historically, there are two common solutions for this
problem:
1) enabling panic_on_oom,
2) using a userspace daemon to monitor OOMs and kill
all outstanding processes.
Both approaches have their downsides: rebooting on each OOM is an obvious
waste of capacity, and handling all in userspace is tricky and requires a
userspace agent, which will monitor all cgroups for OOMs.
In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
the necessity of enabling panic_on_oom. Also, it can simplify the cgroup
management for userspace applications.
This commit introduces a new knob for cgroup v2 memory controller:
memory.oom.group. The knob determines whether the cgroup should be
treated as an indivisible workload by the OOM killer. If set, all tasks
belonging to the cgroup or to its descendants (if the memory cgroup is not
a leaf cgroup) are killed together or not at all.
To determine which cgroup has to be killed, we do traverse the cgroup
hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
and looking for the highest-level cgroup with memory.oom.group set.
Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
an exception and are never killed.
This patch doesn't change the OOM victim selection algorithm.
Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "introduce memory.oom.group", v2.
This is a tiny implementation of cgroup-aware OOM killer, which adds an
ability to kill a cgroup as a single unit and so guarantee the integrity
of the workload.
Although it has only a limited functionality in comparison to what now
resides in the mm tree (it doesn't change the victim task selection
algorithm, doesn't look at memory stas on cgroup level, etc), it's also
much simpler and more straightforward. So, hopefully, we can avoid having
long debates here, as we had with the full implementation.
As it doesn't prevent any futher development, and implements an useful and
complete feature, it looks as a sane way forward.
This patch (of 2):
oom_kill_process() consists of two logical parts: the first one is
responsible for considering task's children as a potential victim and
printing the debug information. The second half is responsible for
sending SIGKILL to all tasks sharing the mm struct with the given victim.
This commit splits oom_kill_process() with an intention to re-use the the
second half: __oom_kill_process().
The cgroup-aware OOM killer will kill multiple tasks belonging to the
victim cgroup. We don't need to print the debug information for the each
task, as well as play with task selection (considering task's children),
so we can't use the existing oom_kill_process().
Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.
free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zone
When called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.
Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.
Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()
This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:
1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure data
These two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():
Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.
Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.
Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()
The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()
Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.
[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function. Not having an ifdef in the function makes the code more
readable.
Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.
Remove __paginginit and use __meminit. Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().
Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.
Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.
This patchset does three things:
1) Clean up/refactor free_area_init_core/free_area_init_node
by moving the ifdefery out of the functions.
2) Move the pgdat/zone initialization in free_area_init_core to its
own function.
3) Introduce free_area_init_core_hotplug, a small subset of
free_area_init_core, which is only called from memhotlug code path. In this
way, we have:
free_area_init_core: called during early initialization
free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)
This patch (of 5):
Moving the #ifdefs out of the function makes it easier to follow.
Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently cgroup-v1's memcg_stat_show traverses the memcg tree ~17 times
to collect the stats while cgroup-v2's memory_stat_show traverses the
memcg tree thrice. On a large machine, a couple thousand memcgs is very
normal and if the churn is high and memcgs stick around during to several
reasons, tens of thousands of nodes in memcg tree can exist. This patch
has refactored and shared the stat collection code between cgroup-v1 and
cgroup-v2 and has reduced the tree traversal to just one.
I ran a simple benchmark which reads the root_mem_cgroup's stat file
1000 times in the presense of 2500 memcgs on cgroup-v1. The results are:
Without the patch:
$ time ./read-root-stat-1000-times
real 0m1.663s
user 0m0.000s
sys 0m1.660s
With the patch:
$ time ./read-root-stat-1000-times
real 0m0.468s
user 0m0.000s
sys 0m0.467s
Link: http://lkml.kernel.org/r/20180724224635.143944-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Bruce Merry <bmerry@ska.ac.za>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.
Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
be enabled explicitly. This updates the documentation for that and adds a
note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
not "random" data, but rather the fixed debugging value that would be used
when not zeroing. Additionally removes a stray "bool" in the Kconfig.
Link: http://lkml.kernel.org/r/20180725223832.GA43733@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Rather than in vm_area_alloc(). To ensure that the various oddball
stack-based vmas are in a good state. Some of the callers were zeroing
them out, others were not.
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The kernel-doc for mempool_init function is missing the description of the
pool parameter. Add it.
Link: http://lkml.kernel.org/r/1532336274-26228-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Andrew has noticed some inconsistencies in oom_reap_task_mm. Notably
- Undocumented return value.
- comment "failed to reap part..." is misleading - sounds like it's
referring to something which happened in the past, is in fact
referring to something which might happen in the future.
- fails to call trace_finish_task_reaping() in one case
- code duplication.
- Increases mmap_sem hold time a little by moving
trace_finish_task_reaping() inside the locked region. So sue me ;)
- Sharing the finish: path means that the trace event won't
distinguish between the two sources of finishing.
Add a short explanation for the return value and fix the rest by
reorganizing the function a bit to have unified function exit paths.
Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The default page memory unit of OOM task dump events might not be
intuitive and potentially misleading for the non-initiated when debugging
OOM events: These are pages and not kBs. Add a small printk prior to the
task dump informing that the memory units are actually memory _pages_.
Also extends PID field to align on up to 7 characters.
Reference https://lkml.org/lkml/2018/7/3/1201
Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.com
Signed-off-by: Rodrigo Freire <rfreire@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
oom_reaper used to rely on the oom_lock since e2fe14564d33 ("oom_reaper:
close race with exiting task"). We do not really need the lock anymore
though. 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run
concurrently") has removed serialization with the exit path based on the
mm reference count and so we do not really rely on the oom_lock anymore.
Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
to prevent from races when the page allocator didn't manage to get the
freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
on and move on to another victim. Although this is possible in principle
let's wait for it to actually happen in real life before we make the
locking more complex again.
Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
oom_reap_task_mm). The reaper serializes with exit_mmap by mmap_sem +
MMF_OOM_SKIP flag. There is no synchronization with out_of_memory path
now.
[mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.
Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.
We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.
This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.
I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.
The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.
The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.
[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
Reported-by: David Rientjes <rientjes@google.com>
Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In this patch, locking related code is shared between huge/normal code
path in put_swap_page() to reduce code duplication. The `free_entries == 0`
case is merged into the more general `free_entries != SWAPFILE_CLUSTER`
case, because the new locking method makes it easy.
The added lines is same as the removed lines. But the code size is
increased when CONFIG_TRANSPARENT_HUGEPAGE=n.
text data bss dec hex filename
base: 24123 2004 340 26467 6763 mm/swapfile.o
unified: 24485 2004 340 26829 68cd mm/swapfile.o
Dig on step deeper with `size -A mm/swapfile.o` for base and unified
kernel and compare the result, yields,
-.text 17723 0
+.text 17835 0
-.orc_unwind_ip 1380 0
+.orc_unwind_ip 1480 0
-.orc_unwind 2070 0
+.orc_unwind 2220 0
-Total 26686
+Total 27048
The total difference is the same. The text segment difference is much
smaller: 112. More difference comes from the ORC unwinder segments:
(1480 + 2220) - (1380 + 2070) = 250. If the frame pointer unwinder is
used, this costs nothing.
Link: http://lkml.kernel.org/r/20180720071845.17920-9-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The part of __swap_entry_free() with lock held is separated into a new
function __swap_entry_free_locked(). Because we want to reuse that
piece of code in some other places.
Just mechanical code refactoring, there is no any functional change in
this function.
Link: http://lkml.kernel.org/r/20180720071845.17920-8-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As suggested by Matthew Wilcox, it is better to use "int entry_size"
instead of "bool cluster" as parameter to specify whether to operate for
huge or normal swap entries. Because this improve the flexibility to
support other swap entry size. And Dave Hansen thinks that this
improves code readability too.
So in this patch, the "bool cluster" parameter of get_swap_pages() is
replaced by "int entry_size".
And nr_swap_entries() trick is used to reduce the binary size when
!CONFIG_TRANSPARENT_HUGE_PAGE.
text data bss dec hex filename
base 24215 2028 340 26583 67d7 mm/swapfile.o
head 24123 2004 340 26467 6763 mm/swapfile.o
Link: http://lkml.kernel.org/r/20180720071845.17920-7-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In this patch, the normal/huge code path in put_swap_page() and several
helper functions are unified to avoid duplicated code, bugs, etc. and
make it easier to review the code.
The removed lines are more than added lines. And the binary size is
kept exactly same when CONFIG_TRANSPARENT_HUGEPAGE=n.
Link: http://lkml.kernel.org/r/20180720071845.17920-6-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
As suggested by Dave, we should unify the code path for normal and huge
swap support if possible to avoid duplicated code, bugs, etc. and make
it easier to review code.
In this patch, the normal/huge code path in
swap_page_trans_huge_swapped() is unified, the added and removed lines
are same. And the binary size is kept almost same when
CONFIG_TRANSPARENT_HUGEPAGE=n.
text data bss dec hex filename
base: 24179 2028 340 26547 67b3 mm/swapfile.o
unified: 24215 2028 340 26583 67d7 mm/swapfile.o
Link: http://lkml.kernel.org/r/20180720071845.17920-5-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In swap_page_trans_huge_swapped(), to identify whether there's any page
table mapping for a 4k sized swap entry, "si->swap_map[i] !=
SWAP_HAS_CACHE" is used. This works correctly now, because all users of
the function will only call it after checking SWAP_HAS_CACHE. But as
pointed out by Daniel, it is better to use "swap_count(map[i])" here,
because it works for "map[i] == 0" case too.
And this makes the implementation more consistent between normal and
huge swap entry.
Link: http://lkml.kernel.org/r/20180720071845.17920-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In mm/swapfile.c, THP (Transparent Huge Page) swap specific code is
enclosed by #ifdef CONFIG_THP_SWAP/#endif to avoid code dilating when
THP isn't enabled. But #ifdef/#endif in .c file hurt the code
readability, so Dave suggested to use IS_ENABLED(CONFIG_THP_SWAP)
instead and let compiler to do the dirty job for us. This has potential
to remove some duplicated code too. From output of `size`,
text data bss dec hex filename
THP=y: 26269 2076 340 28685 700d mm/swapfile.o
ifdef/endif: 24115 2028 340 26483 6773 mm/swapfile.o
IS_ENABLED: 24179 2028 340 26547 67b3 mm/swapfile.o
IS_ENABLED() based solution works quite well, almost as good as that of
#ifdef/#endif. And from the diffstat, the removed lines are more than
added lines.
One #ifdef for split_swap_cluster() is kept. Because it is a public
function with a stub implementation for CONFIG_THP_SWAP=n in swap.h.
Link: http://lkml.kernel.org/r/20180720071845.17920-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "swap: THP optimizing refactoring", v4.
Now the THP (Transparent Huge Page) swap optimizing is implemented in the
way like below,
#ifdef CONFIG_THP_SWAP
huge_function(...)
{
}
#else
normal_function(...)
{
}
#endif
general_function(...)
{
if (huge)
return thp_function(...);
else
return normal_function(...);
}
As pointed out by Dave Hansen, this will,
1. Create a new, wholly untested code path for huge page
2. Create two places to patch bugs
3. Are not reusing code when possible
This patchset is to address these problems via merging huge/normal code
path/functions if possible.
One concern is that this may cause code size to dilate when
!CONFIG_TRANSPARENT_HUGEPAGE. The data shows that most refactoring will
only cause quite slight code size increase.
This patch (of 8):
To improve code readability.
Link: http://lkml.kernel.org/r/20180720071845.17920-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-and-acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is a sad BUG introduced in patch adding SHRINKER_REGISTERING.
shrinker_idr business is only for memcg-aware shrinkers. Only such type
of shrinkers have id and they must be finaly installed via idr_replace()
in this function. For !memcg-aware shrinkers we never initialize
shrinker->id field.
But there are all types of shrinkers passed to idr_replace(), and every
!memcg-aware shrinker with random ID (most probably, its id is 0)
replaces memcg-aware shrinker pointed by the ID in IDR.
This patch fixes the problem.
Link: http://lkml.kernel.org/r/8ff8a793-8211-713a-4ed9-d6e52390c2fc@virtuozzo.com
Fixes: 7e010df53c80 "mm: use special value SHRINKER_REGISTERING instead of list_empty() check"
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: <syzbot+d5f648a1bfe15678786b@syzkaller.appspotmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <syzkaller-bugs@googlegroups.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Variables align_start and align_end are being assigned but are never
used hence they are redundant and can be removed.
Cleans up clang warnings:
warning: variable 'align_start' set but not used [-Wunused-but-set-variable]
warning: variable 'align_size' set but not used [-Wunused-but-set-variable]
Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When perf profiling a wide variety of different workloads, it was found
that vmacache_find() had higher than expected cost: up to 0.08% of cpu
utilization in some cases. This was found to rival other core VM
functions such as alloc_pages_vma() with thp enabled and default
mempolicy, and the conditionals in __get_vma_policy().
VMACACHE_HASH() determines which of the four per-task_struct slots a vma
is cached for a particular address. This currently depends on the pfn,
so pfn 5212 occupies a different vmacache slot than its neighboring pfn
5213.
vmacache_find() iterates through all four of current's vmacache slots
when looking up an address. Hashing based on pfn, an address has
~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or
about 25%, *if* the vma is cached.
This patch hashes an address by its pmd instead of pte to optimize for
workloads with good spatial locality. This results in a higher
probability of vmas being cached in the first slot that is checked:
normally ~70% on the same workloads instead of 25%.
[rientjes@google.com: various updates]
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Provide list_lru_shrink_walk_irq() and let it behave like
list_lru_walk_one() except that it locks the spinlock with
spin_lock_irq(). This is used by scan_shadow_nodes() because its lock
nests within the i_pages lock which is acquired with IRQ. This change
allows to use proper locking promitives instead hand crafted
lock_irq_disable() plus spin_lock().
There is no EXPORT_SYMBOL provided because the current user is in-kernel
only.
Add list_lru_shrink_walk_irq() which acquires the spinlock with the
proper locking primitives.
Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
__list_lru_walk_one()
__list_lru_walk_one() is invoked with struct list_lru *lru, int nid as
the first two argument. Those two are only used to retrieve struct
list_lru_node. Since this is already done by the caller of the function
for the locking, we can pass struct list_lru_node* directly and avoid
the dance around it.
Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move the locking inside __list_lru_walk_one() to its caller. This is a
preparation step in order to introduce list_lru_walk_one_irq() which
does spin_lock_irq() instead of spin_lock() for the locking.
Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user".
This series removes the local_irq_disable() around
list_lru_shrink_walk() (as used by mm/workingset) by adding
list_lru_shrink_walk_irq().
Vladimir Davydov preferred this over `irq' argument which I added to
struct list_lru.
The initial post (of this series) received a Reviewed-by tag by Vladimir
Davydov which I added to each patch of the series. The series applies
on top of akpm's tree which has Kirill's shrink_slab series and does not
clash with it (akpm asked me to wait a week or so and repost it then).
I tested the code paths by triggering the OOM-killer via memory over
commit and lockdep did not complain (nor did I see any warnings).
This patch (of 4):
list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the
memcg_idx parameter. The same can be achieved by list_lru_walk_one() and
passing NULL as memcg argument which then gets converted into -1. This is
a preparation step when the spin_lock() function is lifted to the caller
of __list_lru_walk_one(). Invoke list_lru_walk_one() instead
__list_lru_walk_one() when possible.
Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
to optimize swapping for THP (Transparent Huge Page) without basic
swapping support.
In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
split_swap_cluster() will not be built because it is in swapfile.c, but
it will be called in huge_memory.c. This doesn't trigger a build error
in practice because the call site is enclosed by PageSwapCache(), which
is defined to be constant 0 when CONFIG_SWAP=n. But this is fragile and
should be fixed.
The comments are fixed too to reflect the latest progress.
Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Rename new_sparse_init() to sparse_init() which enables it. Delete old
sparse_init() and all the code that became obsolete with.
[pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
sparse_init() requires to temporary allocate two large buffers: usemap_map
and map_map. Baoquan He has identified that these buffers are so large
that Linux is not bootable on small memory machines, such as a kdump boot.
The buffers are especially large when CONFIG_X86_5LEVEL is set, as they
are scaled to the maximum physical memory size.
Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.
Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section. This eliminates the
need for use of temporary buffers.
For simplified bisecting and review temporarly call sparse_init()
new_sparse_init(), the new interface is going to be enabled as well as old
code removed in the next patch.
Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Now that both variants of sparse memory use the same buffers to populate
memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
common place.
Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Tested-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
non-vmemmap sparse also allocated large contiguous chunk of memory, and if
fails falls back to smaller allocations. Use the same functions to
allocate buffer as the vmemmap-sparse
Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Patch series "sparse_init rewrite", v6.
In sparse_init() we allocate two large buffers to temporary hold usemap
and memmap for the whole machine. However, we can avoid doing that if
we changed sparse_init() to operated on per-node bases instead of doing
it on the whole machine beforehand.
As shown by Baoquan
http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com
The buffers are large enough to cause machine stop to boot on small
memory systems.
Another benefit of these changes is that they also obsolete
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.
This patch (of 5):
When struct pages are allocated for sparse-vmemmap VA layout, we first try
to allocate one large buffer, and than if that fails allocate struct pages
for each section as we go.
The code that allocates buffer is uses global variables and is spread
across several call sites.
Cleanup the code by introducing three functions to handle the global
buffer:
sparse_buffer_init() initialize the buffer
sparse_buffer_fini() free the remaining part of the buffer
sparse_buffer_alloc() alloc from the buffer, and if buffer is empty
return NULL
Define these functions in sparse.c instead of sparse-vmemmap.c because
later we will use them for non-vmemmap sparse allocations as well.
[akpm@linux-foundation.org: use PTR_ALIGN()]
[akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Tested-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When using 1GiB pages during early boot, use the new
memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it.
Zeroing out hundreds or thousands of GiB in a single core memset() call
is very slow, and can make early boot last upwards of 20-30 minutes on
multi TiB machines.
The memory does not need to be zero'd as the hugetlb pages are always
zero'd on page fault.
Tested: Booted with ~3800 1G pages, and it booted successfully in
roughly the same amount of time as with 0, as opposed to the 25+ minutes
it would take before.
Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com
Signed-off-by: Cannon Matthews <cannonmatthews@google.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.
zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.
Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.
In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.
In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.
Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop. Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop. This performance drop
might be mitigated by others' work on optimizing LRU lock.
Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before. My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates. For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.
Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.
The two phase's test results are listed below(all tests are done with THP
disabled).
Phase one(will-it-scale/page_fault1) test results:
Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.
batch score change zone_contention lru_contention total_contention
31 15345900 +0.00% 64% 8% 72%
53 17903847 +16.67% 32% 38% 70%
63 17992886 +17.25% 24% 45% 69%
73 18022825 +17.44% 10% 61% 71%
119 18023401 +17.45% 4% 66% 70%
127 18029012 +17.48% 3% 66% 69%
137 18036075 +17.53% 4% 66% 70%
165 18035964 +17.53% 2% 67% 69%
188 18101105 +17.95% 2% 67% 69%
223 18130951 +18.15% 2% 67% 69%
255 18118898 +18.07% 2% 67% 69%
267 18101559 +17.96% 2% 67% 69%
299 18160468 +18.34% 2% 68% 70%
320 18139845 +18.21% 2% 67% 69%
393 18160869 +18.34% 2% 68% 70%
424 18170999 +18.41% 2% 68% 70%
458 18144868 +18.24% 2% 68% 70%
467 18142366 +18.22% 2% 68% 70%
498 18154549 +18.30% 1% 68% 69%
511 18134525 +18.17% 1% 69% 70%
Broadwell-EX: similar pattern as Skylake-EX.
batch score change zone_contention lru_contention total_contention
31 16703983 +0.00% 67% 7% 74%
53 18195393 +8.93% 43% 28% 71%
63 18288885 +9.49% 38% 33% 71%
73 18344329 +9.82% 35% 37% 72%
119 18535529 +10.96% 24% 46% 70%
127 18513596 +10.83% 23% 48% 71%
137 18514327 +10.84% 23% 48% 71%
165 18511840 +10.82% 22% 49% 71%
188 18593478 +11.31% 17% 53% 70%
223 18601667 +11.36% 17% 52% 69%
255 18774825 +12.40% 12% 58% 70%
267 18754781 +12.28% 9% 60% 69%
299 18892265 +13.10% 7% 63% 70%
320 18873812 +12.99% 8% 62% 70%
393 18891174 +13.09% 6% 64% 70%
424 18975108 +13.60% 6% 64% 70%
458 18932364 +13.34% 8% 62% 70%
467 18960891 +13.51% 5% 65% 70%
498 18944526 +13.41% 5% 64% 69%
511 18960839 +13.51% 5% 64% 69%
Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX. Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.
batch score change zone_contention lru_contention total_contention
31 9554867 +0.00% 66% 3% 69%
53 9855486 +3.15% 63% 3% 66%
63 9980145 +4.45% 62% 4% 66%
73 10092774 +5.63% 62% 5% 67%
119 10310061 +7.90% 45% 19% 64%
127 10342019 +8.24% 42% 19% 61%
137 10358182 +8.41% 42% 21% 63%
165 10397060 +8.81% 37% 24% 61%
188 10341808 +8.24% 34% 26% 60%
223 10349135 +8.31% 31% 27% 58%
255 10327189 +8.08% 28% 29% 57%
267 10344204 +8.26% 27% 29% 56%
299 10325043 +8.06% 25% 30% 55%
320 10310325 +7.91% 25% 31% 56%
393 10293274 +7.73% 21% 31% 52%
424 10311099 +7.91% 21% 32% 53%
458 10321375 +8.02% 21% 32% 53%
467 10303881 +7.84% 21% 32% 53%
498 10332462 +8.14% 20% 33% 53%
511 10325016 +8.06% 20% 32% 52%
Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.
batch score change zone_contention lru_contention total_contention
31 10121178 +0.00% 19% 50% 69%
53 10142366 +0.21% 6% 63% 69%
63 10117984 -0.03% 11% 58% 69%
73 10123330 +0.02% 7% 63% 70%
119 10108791 -0.12% 2% 67% 69%
127 10166074 +0.44% 3% 66% 69%
137 10141574 +0.20% 3% 66% 69%
165 10154499 +0.33% 2% 68% 70%
188 10124921 +0.04% 2% 67% 69%
223 10137399 +0.16% 2% 67% 69%
255 10143289 +0.22% 0% 68% 68%
267 10123535 +0.02% 1% 68% 69%
299 10140952 +0.20% 0% 68% 68%
320 10163170 +0.41% 0% 68% 68%
393 10000633 -1.19% 0% 69% 69%
424 10087998 -0.33% 0% 69% 69%
458 10187116 +0.65% 0% 69% 69%
467 10146790 +0.25% 0% 69% 69%
498 10197958 +0.76% 0% 69% 69%
511 10152326 +0.31% 0% 69% 69%
Haswell-EP: similar to Broadwell-EP.
batch score change zone_contention lru_contention total_contention
31 10442205 +0.00% 14% 48% 62%
53 10442255 +0.00% 5% 57% 62%
63 10452059 +0.09% 6% 57% 63%
73 10482349 +0.38% 5% 59% 64%
119 10454644 +0.12% 3% 60% 63%
127 10431514 -0.10% 3% 59% 62%
137 10423785 -0.18% 3% 60% 63%
165 10481216 +0.37% 2% 61% 63%
188 10448755 +0.06% 2% 61% 63%
223 10467144 +0.24% 2% 61% 63%
255 10480215 +0.36% 2% 61% 63%
267 10484279 +0.40% 2% 61% 63%
299 10466450 +0.23% 2% 61% 63%
320 10452578 +0.10% 2% 61% 63%
393 10499678 +0.55% 1% 62% 63%
424 10481454 +0.38% 1% 62% 63%
458 10473562 +0.30% 1% 62% 63%
467 10484269 +0.40% 0% 62% 62%
498 10505599 +0.61% 0% 62% 62%
511 10483395 +0.39% 0% 62% 62%
Westmere-EP: contention is pretty small so not interesting. Note too high
a batch value could hurt performance.
batch score change zone_contention lru_contention total_contention
31 4831523 +0.00% 2% 3% 5%
53 4834086 +0.05% 2% 4% 6%
63 4834262 +0.06% 2% 3% 5%
73 4832851 +0.03% 2% 4% 6%
119 4830534 -0.02% 1% 3% 4%
127 4827461 -0.08% 1% 4% 5%
137 4827459 -0.08% 1% 3% 4%
165 4820534 -0.23% 0% 4% 4%
188 4817947 -0.28% 0% 3% 3%
223 4809671 -0.45% 0% 3% 3%
255 4802463 -0.60% 0% 4% 4%
267 4801634 -0.62% 0% 3% 3%
299 4798047 -0.69% 0% 3% 3%
320 4793084 -0.80% 0% 3% 3%
393 4785877 -0.94% 0% 3% 3%
424 4782911 -1.01% 0% 3% 3%
458 4779346 -1.08% 0% 3% 3%
467 4780306 -1.06% 0% 3% 3%
498 4780589 -1.05% 0% 3% 3%
511 4773724 -1.20% 0% 3% 3%
Skylake-Desktop: similar to Westmere-EP, nothing interesting.
batch score change zone_contention lru_contention total_contention
31 3906608 +0.00% 2% 3% 5%
53 3940164 +0.86% 2% 3% 5%
63 3937289 +0.79% 2% 3% 5%
73 3940201 +0.86% 2% 3% 5%
119 3933240 +0.68% 2% 3% 5%
127 3930514 +0.61% 2% 4% 6%
137 3938639 +0.82% 0% 3% 3%
165 3908755 +0.05% 0% 3% 3%
188 3905621 -0.03% 0% 3% 3%
223 3903015 -0.09% 0% 4% 4%
255 3889480 -0.44% 0% 3% 3%
267 3891669 -0.38% 0% 4% 4%
299 3898728 -0.20% 0% 4% 4%
320 3894547 -0.31% 0% 4% 4%
393 3875137 -0.81% 0% 4% 4%
424 3874521 -0.82% 0% 3% 3%
458 3880432 -0.67% 0% 4% 4%
467 3888715 -0.46% 0% 3% 3%
498 3888633 -0.46% 0% 4% 4%
511 3875305 -0.80% 0% 5% 5%
Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.
batch score change zone_contention lru_contention total_contention
31 3511158 +0.00% 2% 5% 7%
53 3555445 +1.26% 2% 6% 8%
63 3561082 +1.42% 2% 6% 8%
73 3547218 +1.03% 2% 6% 8%
119 3571319 +1.71% 1% 7% 8%
127 3549375 +1.09% 0% 6% 6%
137 3560233 +1.40% 0% 6% 6%
165 3555176 +1.25% 2% 6% 8%
188 3551501 +1.15% 0% 8% 8%
223 3531462 +0.58% 0% 7% 7%
255 3570400 +1.69% 0% 7% 7%
267 3532235 +0.60% 1% 8% 9%
299 3562326 +1.46% 0% 6% 6%
320 3553569 +1.21% 0% 8% 8%
393 3539519 +0.81% 0% 7% 7%
424 3549271 +1.09% 0% 8% 8%
458 3528885 +0.50% 0% 8% 8%
467 3526554 +0.44% 0% 7% 7%
498 3525302 +0.40% 0% 9% 9%
511 3527556 +0.47% 0% 8% 8%
Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.
batch score change zone_contention lru_contention total_contention
31 1744495 +0.00% 0% 0% 0%
53 1755341 +0.62% 0% 0% 0%
63 1758469 +0.80% 0% 0% 0%
73 1759626 +0.87% 0% 0% 0%
119 1770417 +1.49% 0% 0% 0%
127 1768252 +1.36% 0% 0% 0%
137 1767848 +1.34% 0% 0% 0%
165 1765088 +1.18% 0% 0% 0%
188 1766918 +1.29% 0% 0% 0%
223 1767866 +1.34% 0% 0% 0%
255 1768074 +1.35% 0% 0% 0%
267 1763187 +1.07% 0% 0% 0%
299 1765620 +1.21% 0% 0% 0%
320 1767603 +1.32% 0% 0% 0%
393 1764612 +1.15% 0% 0% 0%
424 1758476 +0.80% 0% 0% 0%
458 1758593 +0.81% 0% 0% 0%
467 1757915 +0.77% 0% 0% 0%
498 1753363 +0.51% 0% 0% 0%
511 1755548 +0.63% 0% 0% 0%
Phase two test results:
Note: all percent change is against base(batch=31).
ebizzy.throughput (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0%
lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1%
lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6%
lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1%
lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4%
lkp-skl-d01 264126 262791 -0.5% 264113 +0.0%
lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1%
lkp-sb02 98937 98937 +0.0% 99030 +0.1%
kbuild.buildtime (less is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1%
lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1%
lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1%
lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4%
lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1%
lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7%
lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1%
netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0%
lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1%
lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1%
lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1%
lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1%
lkp-skl-d01 77314 77303 -0.0% 77375 +0.1%
lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7%
lkp-sb02 29990 30137 +0.5% 30103 +0.4%
oltp.transactions (higer is better)
machine batch=31 batch=63 batch=127
lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4%
lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0%
lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7%
lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4%
lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0%
lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0%
lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8%
pigz.throughput (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2%
lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8%
lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0%
lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7%
lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3%
lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1%
lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0%
lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0%
will-it-scale.malloc1.processes (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0%
lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0%
lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3%
lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0%
lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6%
lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0%
lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2%
lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6%
lkp-sb02 589286 589284 -0.0% 588101 -0.2%
will-it-scale.malloc1.threads (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2%
lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4%
lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9%
lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9%
lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2%
lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1%
lkp-skl-d01 175277 173343 -1.1% 173082 -1.3%
lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8%
lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4%
will-it-scale.malloc2.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0%
lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0%
lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1%
lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1%
lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1%
lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1%
lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0%
lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0%
lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0%
will-it-scale.malloc2.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0%
lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0%
lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2%
lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3%
lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0%
lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0%
lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0%
lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0%
lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0%
will-it-scale.page_fault2.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0%
lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0%
lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3%
lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9%
lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7%
lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5%
lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9%
lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6%
lkp-sb02 842616 837844 -0.6% 833921 -1.0%
will-it-scale.page_fault2.threads
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1%
lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9%
lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0%
lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8%
lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3%
lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9%
lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4%
lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5%
lkp-sb02 750626 748615 -0.3% 746621 -0.5%
will-it-scale.page_fault3.processes (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2%
lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2%
lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0%
lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3%
lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3%
lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9%
lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5%
lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5%
lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4%
will-it-scale.page_fault3.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1%
lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8%
lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9%
lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6%
lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5%
lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9%
lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9%
lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9%
lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6%
will-it-scale.read1.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable)
lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2%
lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1%
lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6%
lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2%
lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2%
lkp-skl-d01 10969487 10972650 +0.0% no data
lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8%
lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5%
will-it-scale.read1.threads (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable)
lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3%
lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6%
lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3%
lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1%
lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7%
lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1%
lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1%
lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0%
will-it-scale.write1.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0%
lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4%
lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1%
lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9%
lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7%
lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9%
lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8%
lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4%
lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6%
will-it-scale.write1.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7%
lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6%
lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7%
lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1%
lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8%
lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8%
lkp-skl-d01 8346174 8235695 -1.3% no data
lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3%
lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0%
vm-scalability.anon-r-rand.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1%
lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9%
lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4%
lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7%
lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5%
lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2%
lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8%
lkp-sb02 570572 567854 -0.5% 568449 -0.4%
vm-scalability.anon-r-rand-mt.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1%
lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9%
lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6%
lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9%
lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0%
lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0%
lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1%
lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9%
lkp-sb02 567137 567346 +0.0% 566144 -0.2%
vm-scalability.lru-file-mmap-read.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7%
lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9%
lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5%
lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1%
lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7%
lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3%
lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9%
lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4%
lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3%
vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6%
lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2%
lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0%
lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1%
lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1%
lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1%
lkp-skl-d01 696689 695537 -0.2% 694106 -0.4%
lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2%
lkp-sb02 258939 258787 -0.1% 258199 -0.3%
Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add comments describing oom_lock's scope.
Requested-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts ee8f248d266e ("hugetlb: add phys addr to struct
huge_bootmem_page").
At one time powerpc used this field and supporting code. However that
was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support
for reserving gigantic huge pages via kernel command line").
There are no users of this field and supporting code, so remove it.
Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Cannon Matthews <cannonmatthews@google.com>
Cc: Becky Bruce <beckyb@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent
munlock and oom reaper unmap, v3") we have a strong synchronization
between the oom_killer and victim's exiting because both have to take
the oom_lock. Therefore the original heuristic to sleep for a short
time in out_of_memory doesn't serve the original purpose.
Moreover Tetsuo has noticed that the short sleep can be more harmful
than actually useful. Hammering the system with many processes can lead
to a starvation when the task holding the oom_lock can block for a long
time (minutes) and block any further progress because the oom_reaper
depends on the oom_lock as well.
Drop the short sleep from out_of_memory when we hold the lock. Keep the
sleep when the trylock fails to throttle the concurrent OOM paths a bit.
This should be solved in a more reasonable way (e.g. sleep proportional
to the time spent in the active reclaiming etc.) but this is much more
complex thing to achieve. This is a quick fixup to remove a stale code.
Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so
convert gfp_mask parameter to boolean no_warn parameter.
This will help to avoid giving false feeling that this function supports
standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer,
what has already been an issue: see commit dd65a941f6ba ("arm64:
dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag").
Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michał Nazarewicz <mina86@mina86.com>
Acked-by: Laura Abbott <labbott@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There was a bug in Linux that could cause madvise (and mprotect?) system
calls to return to userspace without the TLB having been flushed for all
the pages involved.
This could happen when multiple threads of a process made simultaneous
madvise and/or mprotect calls.
This was noticed in the summer of 2017, at which time two solutions
were created:
56236a59556c ("mm: refactor TLB gathering API")
99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem")
and
4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range")
We need only one of these solutions, and the former appears to be a
little more efficient than the latter, so revert that one.
This reverts 4647706ebeee6e50 ("mm: always flush VMA ranges affected by
zap_page_range")
Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS. They are used to store
each memory section's usemap and mem map if marked as present. With the
help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node. This avoids too many
memory fragmentations. Like below diagram, '1' indicates the present
memory section, '0' means absent one. The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.
|1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
-------------------------------------------------------
0 1 2 3 4 5 i i+1 n-1 n
If we fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not
present. After use, these two arrays will be released at the end of
sparse_init().
In 4-level paging mode, each array costs 4M which can be ignorable.
While in 5-level paging, they costs 256M each, 512M altogether. Kdump
kernel Usually only reserves very few memory, e.g 256M. So, even thouth
they are temporarily allocated, still not acceptable.
In fact, there's no need to allocate them with the size of
NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred
to the last, the number of present memory sections are kept the same
during sparse_init() until we finally clear out the memory section's
->section_mem_map if its usemap or memmap is not correctly handled.
Thus in the middle whenever for_each_present_section_nr() loop is taken,
the i-th present memory section is always the same one.
Here only allocate usemap_map and map_map with the size of
'nr_present_sections'. For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation.
Then in the last for_each_present_section_nr() loop which clears the
failed memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oscar Salvador <osalvador@techadventures.net>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It's used to pass the size of map data unit into
alloc_usemap_and_memmap, and is preparation for next patch.
Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one. If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being
present. Like this, the number of mem sections marked as present could
become less during sparse_init() execution.
Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop. This
is in preparation for later optimizing the mem map allocation.
[akpm@linux-foundation.org: remove now-unused local `ms', per Oscar]
Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pankaj Gupta <pagupta@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|