summaryrefslogtreecommitdiff
path: root/mm/percpu.c
AgeCommit message (Collapse)AuthorFilesLines
2015-06-25mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()Larry Finger1-1/+1
Beginning at commit d52d3997f843 ("ipv6: Create percpu rt6_info"), the following INFO splat is logged: =============================== [ INFO: suspicious RCU usage. ] 4.1.0-rc7-next-20150612 #1 Not tainted ------------------------------- kernel/sched/core.c:7318 Illegal context switch in RCU-bh read-side critical section! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 3 locks held by systemd/1: #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff815f0c8f>] rtnetlink_rcv+0x1f/0x40 #1: (rcu_read_lock_bh){......}, at: [<ffffffff816a34e2>] ipv6_add_addr+0x62/0x540 #2: (addrconf_hash_lock){+...+.}, at: [<ffffffff816a3604>] ipv6_add_addr+0x184/0x540 stack backtrace: CPU: 0 PID: 1 Comm: systemd Not tainted 4.1.0-rc7-next-20150612 #1 Hardware name: TOSHIBA TECRA A50-A/TECRA A50-A, BIOS Version 4.20 04/17/2014 Call Trace: dump_stack+0x4c/0x6e lockdep_rcu_suspicious+0xe7/0x120 ___might_sleep+0x1d5/0x1f0 __might_sleep+0x4d/0x90 kmem_cache_alloc+0x47/0x250 create_object+0x39/0x2e0 kmemleak_alloc_percpu+0x61/0xe0 pcpu_alloc+0x370/0x630 Additional backtrace lines are truncated. In addition, the above splat is followed by several "BUG: sleeping function called from invalid context at mm/slub.c:1268" outputs. As suggested by Martin KaFai Lau, these are the clue to the fix. Routine kmemleak_alloc_percpu() always uses GFP_KERNEL for its allocations, whereas it should follow the gfp from its callers. Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@vger.kernel.org> [3.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-24percpu: Fix trivial typos in commentsYannick Guerrini1-2/+2
Change 'tranlated' to 'translated' Change 'mutliples' to 'multiples' Signed-off-by: Yannick Guerrini <yguerrini@tomshardware.fr> Signed-off-by: Tejun Heo <tj@kernel.org>
2015-02-14percpu: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo1-4/+2
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-29percpu: off by one in BUG_ON()Dan Carpenter1-1/+1
The unit_map[] array has "nr_cpu_ids" number of elements. It's allocated a few lines earlier in the function. So this test should be >= instead of >. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-08percpu: fix how @gfp is interpreted by the percpu allocatorTejun Heo1-1/+1
When @gfp is specified, the percpu allocator is interested in whether it contains all of GFP_KERNEL or not. If it does, the normal allocation path is taken; otherwise, the atomic allocation path. Unfortunately, pcpu_alloc() was incorrectly testing for whether @gfp contains any part of GFP_KERNEL. Fix it by testing "(gfp & GFP_KERNEL) != GFP_KERNEL" instead of "!(gfp & GFP_KERNEL)" to decide whether the allocation should be atomic or not. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-22Revert "percpu: free percpu allocation info for uniprocessor system"Guenter Roeck1-2/+0
This reverts commit 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system"). The commit causes a hang with a crisv32 image. This may be an architecture problem, but at least for now the revert is necessary to be able to boot a crisv32 image. Cc: Tejun Heo <tj@kernel.org> Cc: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 3189eddbcafc ("percpu: free percpu allocation info for uniprocessor system") Cc: stable@vger.kernel.org # Please don't apply 3189eddbcafc
2014-09-09percpu: fix locking regression in the failure path of pcpu_alloc()Tejun Heo1-0/+1
While updating locking, b38d08f3181c ("percpu: restructure locking") broke pcpu_create_chunk() creation path in pcpu_alloc(). It returns without releasing pcpu_alloc_mutex. Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Julia Lawall <julia.lawall@lip6.fr>
2014-09-02percpu: implement asynchronous chunk populationTejun Heo1-4/+113
The percpu allocator now supports atomic allocations by only allocating from already populated areas but the mechanism to ensure that there's adequate amount of populated areas was missing. This patch expands pcpu_balance_work so that in addition to freeing excess free chunks it also populates chunks to maintain an adequate level of populated areas. pcpu_alloc() schedules pcpu_balance_work if the amount of free populated areas is too low or after an atomic allocation failure. * PERPCU_DYNAMIC_RESERVE is increased by two pages to account for PCPU_EMPTY_POP_PAGES_LOW. * pcpu_async_enabled is added to gate both async jobs - chunk->map_extend_work and pcpu_balance_work - so that we don't end up scheduling them while the needed subsystems aren't up yet. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: rename pcpu_reclaim_work to pcpu_balance_workTejun Heo1-15/+12
pcpu_reclaim_work will also be used to populate chunks asynchronously. Rename it to pcpu_balance_work in preparation. pcpu_reclaim() is renamed to pcpu_balance_workfn() and some of its local variables are renamed too. This is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populatedTejun Heo1-9/+113
pcpu_nr_empty_pop_pages counts the number of empty populated pages across all chunks and chunk->nr_populated counts the number of populated pages in a chunk. Both will be used to implement pre/async population for atomic allocations. pcpu_chunk_[de]populated() are added to update chunk->populated, chunk->nr_populated and pcpu_nr_empty_pop_pages together. All successful chunk [de]populations should be followed by the corresponding pcpu_chunk_[de]populated() calls. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make sure chunk->map array has available spaceTejun Heo1-8/+45
An allocation attempt may require extending chunk->map array which requires GFP_KERNEL context which isn't available for atomic allocations. This patch ensures that chunk->map array usually keeps some amount of available space by directly allocating buffer space during GFP_KERNEL allocations and scheduling async extension during atomic ones. This should make atomic allocation failures from map space exhaustion rare. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: implement [__]alloc_percpu_gfp()Tejun Heo1-25/+39
Now that pcpu_alloc_area() can allocate only from populated areas, it's easy to add atomic allocation support to [__]alloc_percpu(). Update pcpu_alloc() so that it accepts @gfp and skips all the blocking operations and allocates only from the populated areas if @gfp doesn't contain GFP_KERNEL. New interface functions [__]alloc_percpu_gfp() are added. While this means that atomic allocations are possible, this isn't complete yet as there's no mechanism to ensure that certain amount of populated areas is kept available and atomic allocations may keep failing under certain conditions. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: indent the population block in pcpu_alloc()Tejun Heo1-17/+21
The next patch will conditionalize the population block in pcpu_alloc() which will end up making a rather large indentation change obfuscating the actual logic change. This patch puts the block under "if (true)" so that the next patch can avoid indentation changes. The defintions of the local variables which are used only in the block are moved into the block. This patch is purely cosmetic. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: make pcpu_alloc_area() capable of allocating only from populated areasTejun Heo1-7/+58
Update pcpu_alloc_area() so that it can skip unpopulated areas if the new parameter @pop_only is true. This is implemented by a new function, pcpu_fit_in_area(), which determines the amount of head padding considering the alignment and populated state. @pop_only is currently always false but this will be used to implement atomic allocation. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: restructure lockingTejun Heo1-40/+35
At first, the percpu allocator required a sleepable context for both alloc and free paths and used pcpu_alloc_mutex to protect everything. Later, pcpu_lock was introduced to protect the index data structure so that the free path can be invoked from atomic contexts. The conversion only updated what's necessary and left most of the allocation path under pcpu_alloc_mutex. The percpu allocator is planned to add support for atomic allocation and this patch restructures locking so that the coverage of pcpu_alloc_mutex is further reduced. * pcpu_alloc() now grab pcpu_alloc_mutex only while creating a new chunk and populating the allocated area. Everything else is now protected soley by pcpu_lock. After this change, multiple instances of pcpu_extend_area_map() may race but the function already implements sufficient synchronization using pcpu_lock. This also allows multiple allocators to arrive at new chunk creation. To avoid creating multiple empty chunks back-to-back, a new chunk is created iff there is no other empty chunk after grabbing pcpu_alloc_mutex. * pcpu_lock is now held while modifying chunk->populated bitmap. After this, all data structures are protected by pcpu_lock. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move region iterations out of pcpu_[de]populate_chunk()Tejun Heo1-11/+8
Previously, pcpu_[de]populate_chunk() were called with the range which may contain multiple target regions in it and pcpu_[de]populate_chunk() iterated over the regions. This has the benefit of batching up cache flushes for all the regions; however, we're planning to add more bookkeeping logic around [de]population to support atomic allocations and this delegation of iterations gets in the way. This patch moves the region iterations out of pcpu_[de]populate_chunk() into its callers - pcpu_alloc() and pcpu_reclaim() - so that we can later add logic to track more states around them. This change may make cache and tlb flushes more frequent but multi-region [de]populations are rare anyway and if this actually becomes a problem, it's not difficult to factor out cache flushes as separate callbacks which are directly invoked from percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-02percpu: move common parts out of pcpu_[de]populate_chunk()Tejun Heo1-9/+30
percpu-vm and percpu-km implement separate versions of pcpu_[de]populate_chunk() and some part which is or should be common are currently in the specific implementations. Make the following changes. * Allocate area clearing is moved from the pcpu_populate_chunk() implementations to pcpu_alloc(). This makes percpu-km's version noop. * Quick exit tests in pcpu_[de]populate_chunk() of percpu-vm are moved to their respective callers so that they are applied to percpu-km too. This doesn't make any meaningful difference as both functions are noop for percpu-km; however, this is more consistent and will help implementing atomic allocation support. Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-16percpu: free percpu allocation info for uniprocessor systemHonggang Li1-0/+2
Currently, only SMP system free the percpu allocation info. Uniprocessor system should free it too. For example, one x86 UML virtual machine with 256MB memory, UML kernel wastes one page memory. Signed-off-by: Honggang Li <enjoymindful@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org
2014-06-19percpu: Use ALIGN macro instead of hand coding alignment calculationChristoph Lameter1-2/+1
Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-04-15percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan1-1/+1
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Cc: stable@vger.kernel.org
2014-03-29percpu: renew the max_contig if we merge the head and previous blockJianyu Zhan1-1/+3
During pcpu_alloc_area(), we might merge the current head with the previous block. Since we have calculated the max_contig using the size of previous block before we skip it, and now we update the size of previous block, so we should renew the max_contig. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-18percpu: allocation size should be evenViro1-1/+4
723ad1d90b56 ("percpu: store offsets instead of lengths in ->map[]") updated percpu area allocator to use the lowest bit, instead of sign, to signify whether the area is occupied and forced min align to 2; unfortunately, it forgot to force the allocation size to be even causing malfunctions for the very rare odd-sized allocations. Always force the allocations to be even sized. tj: Wrote patch description. Original-patch-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: speed alloc_pcpu_area() upAl Viro1-1/+17
If we know that first N areas are all in use, we can obviously skip them when searching for a free one. And that kind of hint is very easy to maintain. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07percpu: store offsets instead of lengths in ->map[]Al Viro1-55/+81
Current code keeps +-length for each area in chunk->map[]. It has several unpleasant consequences: * even if we know that first 50 areas are all in use, allocation still needs to go through all those areas just to sum their sizes, just to get the offset of free one. * freeing needs to find the array entry refering to the area in question; again, the need to sum the sizes until we reach the offset we are interested in. Note that offsets are monotonous, so simple binary search would do here. New data representation: array of <offset,in-use flag> pairs. Each pair is represented by one int - we use offset|1 for <offset, in use> and offset for <offset, free> (we make sure that all offsets are even). In the end we put a sentry entry - <total size, in use>. The first entry is <0, flag>; it would be possible to store together the flag for Nth area and offset for N+1st, but that leads to much hairier code. In other words, where the old variant would have 4, -8, -4, 4, -12, 100 (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store <0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1> i.e. 0, 5, 13, 16, 21, 32, 133 This commit switches to new data representation and takes care of a couple of low-hanging fruits in free_pcpu_area() - one is the switch to binary search, another is not doing two memmove() when one would do. Speeding the alloc side up (by keeping track of how many areas in the beginning are known to be all in use) also becomes possible - that'll be done in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-03-07perpcu: fold pcpu_split_block() into the only callerAl Viro1-47/+16
... and simplify the results a bit. Makes the next step easier to deal with - we will be changing the data representation for chunk->map[] and it's easier to do if the code in question is not split between pcpu_alloc_area() and pcpu_split_block(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-01-22Merge branch 'akpm' (incoming from Andrew)Linus Torvalds1-16/+22
Merge first patch-bomb from Andrew Morton: - a couple of misc things - inotify/fsnotify work from Jan - ocfs2 updates (partial) - about half of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits) mm/migrate: remove unused function, fail_migrate_page() mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages mm/migrate: correct failure handling if !hugepage_migration_support() mm/migrate: add comment about permanent failure path mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure mm: compaction: reset scanner positions immediately when they meet mm: compaction: do not mark unmovable pageblocks as skipped in async compaction mm: compaction: detect when scanners meet in isolate_freepages mm: compaction: reset cached scanner pfn's before reading them mm: compaction: encapsulate defer reset logic mm: compaction: trace compaction begin and end memcg, oom: lock mem_cgroup_print_oom_info sched: add tracepoints related to NUMA task migration mm: numa: do not automatically migrate KSM pages mm: numa: trace tasks that fail migration due to rate limiting mm: numa: limit scope of lock for NUMA migrate rate limiting mm: numa: make NUMA-migrate related functions static lib/show_mem.c: show num_poisoned_pages when oom mm/hwpoison: add '#' to hwpoison_inject mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter ...
2014-01-22mm/percpu.c: use memblock apis for early memory allocationsSantosh Shilimkar1-16/+22
Switch to memblock interfaces for early memory allocator instead of bootmem allocator. No functional change in beahvior than what it is in current code from bootmem users points of view. Archs already converted to NO_BOOTMEM now directly use memblock interfaces instead of bootmem wrappers build on top of memblock. And the archs which still uses bootmem, these new apis just fallback to exiting bootmem APIs. Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Tony Lindgren <tony@atomide.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21percpu: use VMALLOC_TOTAL instead of VMALLOC_END - VMALLOC_STARTLaura Abbott1-2/+2
vmalloc already gives a useful macro to calculate the total vmalloc size. Use it. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-23percpu: fix bootmem error handling in pcpu_page_first_chunk()Michael Holzheu1-2/+3
If memory allocation of in pcpu_embed_first_chunk() fails, the allocated memory is not released correctly. In the release loop also the non-allocated elements are released which leads to the following kernel BUG on systems with very little memory: [ 0.000000] kernel BUG at mm/bootmem.c:307! [ 0.000000] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 0.000000] Modules linked in: [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22 [ 0.000000] task: 0000000000a20ae0 ti: 0000000000a08000 task.ti: 0000000000a08000 [ 0.000000] Krnl PSW : 0400000180000000 0000000000abda7a (__free+0x116/0x154) [ 0.000000] R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 EA:3 ... [ 0.000000] [<0000000000abdce2>] mark_bootmem_node+0xde/0xf0 [ 0.000000] [<0000000000abdd9c>] mark_bootmem+0xa8/0x118 [ 0.000000] [<0000000000abcbba>] pcpu_embed_first_chunk+0xe7a/0xf0c [ 0.000000] [<0000000000abcc96>] setup_per_cpu_areas+0x4a/0x28c To fix the problem now only allocated elements are released. This then leads to the correct kernel panic: [ 0.000000] Kernel panic - not syncing: Failed to initialize percpu areas. ... [ 0.000000] Call Trace: [ 0.000000] ([<000000000011307e>] show_trace+0x132/0x150) [ 0.000000] [<0000000000113160>] show_stack+0xc4/0xd4 [ 0.000000] [<00000000007127dc>] dump_stack+0x74/0xd8 [ 0.000000] [<00000000007123fe>] panic+0xea/0x264 [ 0.000000] [<0000000000b14814>] setup_per_cpu_areas+0x5c/0x28c tj: Flipped if conditional so that it doesn't need "continue". Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-12-02mm, percpu: Make sure percpu_alloc early parameter has an argumentCyrill Gorcunov1-0/+3
Otherwise we are getting a nil dereference if percpu_alloc kernel boot argument is specified without value. | [ 0.000000] BUG: unable to handle kernel NULL pointer dereference at (null) | [ 0.000000] IP: [<ffffffff81391360>] strcmp+0x10/0x30 Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-29percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()Joonsoo Kim1-1/+1
commit 099a19d9('allow limited allocation before slab is online') made pcpu_alloc_chunk() use pcpu_mem_zalloc() but forgot to update pcpu_free_chunk() accordingly. This doesn't cause any immediate problema, but fix it for consistency. tj: commit message updated Signed-off-by: Joonsoo Kim <js1304@gmail.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-10-05sections: fix section conflicts in mm/percpu.cAndi Kleen1-1/+1
Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-09kmemleak: Fix the kmemleak tracking of the percpu areas with !SMPCatalin Marinas1-0/+2
Kmemleak tracks the percpu allocations via a specific API and the originally allocated areas must be removed from kmemleak (via kmemleak_free). The code was already doing this for SMP systems. Reported-by: Sami Liedes <sami.liedes@iki.fi> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-05-09percpu: pcpu_embed_first_chunk() should free unused parts after all allocs ↵Tejun Heo1-0/+10
are complete pcpu_embed_first_chunk() allocates memory for each node, copies percpu data and frees unused portions of it before proceeding to the next group. This assumes that allocations for different nodes doesn't overlap; however, depending on memory topology, the bootmem allocator may end up allocating memory from a different node than the requested one which may overlap with the portion freed from one of the previous percpu areas. This leads to percpu groups for different nodes overlapping which is a serious bug. This patch separates out copy & partial free from the allocation loop such that all allocations are complete before partial frees happen. This also fixes overlapping frees which could happen on allocation failure path - out_free_areas path frees whole groups but the groups could have portions freed at that point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reported-by: "Pavel V. Panteleev" <pp_84@mail.ru> Tested-by: "Pavel V. Panteleev" <pp_84@mail.ru> LKML-Reference: <E1SNhwY-0007ui-V7.pp_84-mail-ru@f220.mail.ru>
2012-03-29percpu: use KERN_CONT in pcpu_dump_alloc_info()Tejun Heo1-5/+5
pcpu_dump_alloc_info() was printing continued lines without KERN_CONT. Use it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Kay Sievers <kay.sievers@vrfy.org>
2012-01-15Merge tag 'kmemleak' of ↵Linus Torvalds1-1/+11
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux Kmemleak patches Main features: - Handle percpu memory allocations (only scanning them, not actually reporting). - Memory hotplug support. Usability improvements: - Show the origin of early allocations. - Report previously found leaks even if kmemleak has been disabled by some error. * tag 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux: kmemleak: Add support for memory hotplug kmemleak: Handle percpu memory allocation kmemleak: Report previously found leaks even after an error kmemleak: When the early log buffer is exceeded, report the actual number kmemleak: Show where early_log issues come from
2011-12-15percpu: fix per_cpu_ptr_to_phys() handling of non-page-aligned addressesEugene Surovegin1-2/+4
per_cpu_ptr_to_phys() incorrectly rounds up its result for non-kmalloc case to the page boundary, which is bogus for any non-page-aligned address. This affects the only in-tree user of this function - sysfs handler for per-cpu 'crash_notes' physical address. The trouble is that the crash_notes per-cpu variable is not page-aligned: crash_notes = 0xc08e8ed4 PER-CPU OFFSET VALUES: CPU 0: 3711f000 CPU 1: 37129000 CPU 2: 37133000 CPU 3: 3713d000 So, the per-cpu addresses are: crash_notes on CPU 0: f7a07ed4 => phys 36b57ed4 crash_notes on CPU 1: f7a11ed4 => phys 36b4ded4 crash_notes on CPU 2: f7a1bed4 => phys 36b43ed4 crash_notes on CPU 3: f7a25ed4 => phys 36b39ed4 However, /sys/devices/system/cpu/cpu*/crash_notes says: /sys/devices/system/cpu/cpu0/crash_notes: 36b57000 /sys/devices/system/cpu/cpu1/crash_notes: 36b4d000 /sys/devices/system/cpu/cpu2/crash_notes: 36b43000 /sys/devices/system/cpu/cpu3/crash_notes: 36b39000 As you can see, all values are rounded down to a page boundary. Consequently, this is where kexec sets up the NOTE segments, and thus where the secondary kernel is looking for them. However, when the first kernel crashes, it saves the notes to the unaligned addresses, where they are not found. Fix it by adding offset_in_page() to the translated page address. -tj: Combined Eugene's and Petr's commit messages. Signed-off-by: Eugene Surovegin <ebs@ebshome.net> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Petr Tesarik <ptesarik@suse.cz> Cc: stable@kernel.org
2011-12-02kmemleak: Handle percpu memory allocationCatalin Marinas1-1/+11
This patch adds kmemleak callbacks from the percpu allocator, reducing a number of false positives caused by kmemleak not scanning such memory blocks. The percpu chunks are never reported as leaks because of current kmemleak limitations with the __percpu pointer not pointing directly to the actual chunks. Reported-by: Huajun Li <huajun.li.lee@gmail.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2011-11-23percpu: explain why per_cpu_ptr_to_phys() is more complicated than necessaryDave Young1-0/+11
Add comments about current per_cpu_ptr_to_phys implementation to explain why the logic is more complicated than necessary. -tj: relocated comment into kerneldoc comment Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-11-22percpu: fix chunk range calculationTejun Heo1-14/+20
Percpu allocator recorded the cpus which map to the first and last units in pcpu_first/last_unit_cpu respectively and used them to determine the address range of a chunk - e.g. it assumed that the first unit has the lowest address in a chunk while the last unit has the highest address. This simply isn't true. Groups in a chunk can have arbitrary positive or negative offsets from the previous one and there is no guarantee that the first unit occupies the lowest offset while the last one the highest. Fix it by actually comparing unit offsets to determine cpus occupying the lowest and highest offsets. Also, rename pcu_first/last_unit_cpu to pcpu_low/high_unit_cpu to avoid confusion. The chunk address range is used to flush cache on vmalloc area map/unmap and decide whether a given address is in the first chunk by per_cpu_ptr_to_phys() and the bug was discovered by invalid per_cpu_ptr_to_phys() translation for crash_note. Kudos to Dave Young for tracking down the problem. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: WANG Cong <xiyou.wangcong@gmail.com> Reported-by: Dave Young <dyoung@redhat.com> Tested-by: Dave Young <dyoung@redhat.com> LKML-Reference: <4EC21F67.10905@redhat.com> Cc: stable @kernel.org
2011-11-22percpu: rename pcpu_mem_alloc to pcpu_mem_zallocBob Liu1-8/+9
Currently pcpu_mem_alloc() is implemented always return zeroed memory. So rename it to make user like pcpu_get_pages_and_bitmap() know don't reinit it. Signed-off-by: Bob Liu <lliubbo@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-05-24Merge branch 'for-2.6.40' of ↵Linus Torvalds1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: Unify input section names percpu: Avoid extra NOP in percpu_cmpxchg16b_double percpu: Cast away printk format warning percpu: Always align percpu output section to PAGE_SIZE Fix up fairly trivial conflict in arch/x86/include/asm/percpu.h as per Tejun
2011-05-24Merge branch 'fixes-2.6.39' into for-2.6.40Tejun Heo1-2/+2
2011-03-31Fix common misspellingsLucas De Marchi1-5/+5
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-28percpu: Cast away printk format warningMike Frysinger1-2/+2
On 32-bit systems which don't happen to implicitly define or cast VMALLOC_START and/or VMALLOC_END to long in their arch headers, the printk in the percpu code will cause a warning to be emitted: mm/percpu.c: In function 'pcpu_embed_first_chunk': mm/percpu.c:1648: warning: format '%lx' expects type 'long unsigned int', but argument 3 has type 'unsigned int' So add an explicit cast to unsigned long here. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2011-03-28NOMMU: percpu should use is_vmalloc_addr().David Howells1-2/+1
per_cpu_ptr_to_phys() uses VMALLOC_START and VMALLOC_END to determine if an address is in the vmalloc() region or not. This is incorrect on NOMMU as there is no real vmalloc() capability (vmalloc() is emulated by kmalloc()). The correct way to do this is to use is_vmalloc_addr(). This encapsulates the vmalloc() region test in MMU mode and just returns 0 in NOMMU mode. On FRV in NOMMU mode, the percpu compilation fails without this patch: mm/percpu.c: In function 'per_cpu_ptr_to_phys': mm/percpu.c:1011: error: 'VMALLOC_START' undeclared (first use in this function) mm/percpu.c:1011: error: (Each undeclared identifier is reported only once mm/percpu.c:1011: error: for each function it appears in.) mm/percpu.c:1012: error: 'VMALLOC_END' undeclared (first use in this function) mm/percpu.c:1018: warning: control reaches end of non-void function Signed-off-by: David Howells <dhowells@redhat.com>
2011-03-24percpu: Always align percpu output section to PAGE_SIZETejun Heo1-0/+2
Percpu allocator honors alignment request upto PAGE_SIZE and both the percpu addresses in the percpu address space and the translated kernel addresses should be aligned accordingly. The calculation of the former depends on the alignment of percpu output section in the kernel image. The linker script macros PERCPU_VADDR() and PERCPU() are used to define this output section and the latter takes @align parameter. Several architectures are using @align smaller than PAGE_SIZE breaking percpu memory alignment. This patch removes @align parameter from PERCPU(), renames it to PERCPU_SECTION() and makes it always align to PAGE_SIZE. While at it, add PCPU_SETUP_BUG_ON() checks such that alignment problems are reliably detected and remove percpu alignment comment recently added in workqueue.c as the condition would trigger BUG way before reaching there. For um, this patch raises the alignment of percpu area. As the area is in .init, there shouldn't be any noticeable difference. This problem was discovered by David Howells while debugging boot failure on mn10300. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Mike Frysinger <vapier@gentoo.org> Cc: uclinux-dist-devel@blackfin.uclinux.org Cc: David Howells <dhowells@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: user-mode-linux-devel@lists.sourceforge.net
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-08Merge branch 'for-2.6.38' of ↵Linus Torvalds1-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits) gameport: use this_cpu_read instead of lookup x86: udelay: Use this_cpu_read to avoid address calculation x86: Use this_cpu_inc_return for nmi counter x86: Replace uses of current_cpu_data with this_cpu ops x86: Use this_cpu_ops to optimize code vmstat: User per cpu atomics to avoid interrupt disable / enable irq_work: Use per cpu atomics instead of regular atomics cpuops: Use cmpxchg for xchg to avoid lock semantics x86: this_cpu_cmpxchg and this_cpu_xchg operations percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support percpu,x86: relocate this_cpu_add_return() and friends connector: Use this_cpu operations xen: Use this_cpu_inc_return taskstats: Use this_cpu_ops random: Use this_cpu_inc_return fs: Use this_cpu_inc_return in buffer.c highmem: Use this_cpu_xx_return() operations vmstat: Use this_cpu_inc_return for vm statistics x86: Support for this_cpu_add, sub, dec, inc_return percpu: Generic support for this_cpu_add, sub, dec, inc_return ... Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c} as per Tejun.
2010-12-22percpu: print out alloc information with KERN_DEBUG instead of KERN_INFOTejun Heo1-1/+1
Now that percpu allocator is mostly stable, there is no reason to print alloc information with KERN_INFO and clutter the boot messages. Switch it to KERN_DEBUG. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Mike Travis <travis@sgi.com>