summaryrefslogtreecommitdiff
path: root/mm/slab.c
AgeCommit message (Collapse)AuthorFilesLines
2012-08-01mm: micro-optimise slab to avoid a function callMel Gorman1-2/+26
Getting and putting objects in SLAB currently requires a function call but the bulk of the work is related to PFMEMALLOC reserves which are only consumed when network-backed storage is critical. Use an inline function to determine if the function call is required. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: introduce __GFP_MEMALLOC to allow access to emergency reservesMel Gorman1-1/+1
__GFP_MEMALLOC will allow the allocation to disregard the watermarks, much like PF_MEMALLOC. It allows one to pass along the memalloc state in object related allocation flags as opposed to task related flags, such as sk->sk_allocation. This removes the need for ALLOC_PFMEMALLOC as callers using __GFP_MEMALLOC can get the ALLOC_NO_WATERMARK flag which is now enough to identify allocations related to page reclaim. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01mm: sl[au]b: add knowledge of PFMEMALLOC reserve pagesMel Gorman1-18/+174
When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-09mm, sl[aou]b: Move kmem_cache_create mutex handling to common codeChristoph Lameter1-51/+1
Move the mutex handling into the common kmem_cache_create() function. Then we can also move more checks out of SLAB's kmem_cache_create() into the common code. Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Use a common mutex definitionChristoph Lameter1-57/+51
Use the mutex definition from SLAB and make it the common way to take a sleeping lock. This has the effect of using a mutex instead of a rw semaphore for SLUB. SLOB gains the use of a mutex for kmem_cache_create serialization. Not needed now but SLOB may acquire some more features later (like slabinfo / sysfs support) through the expansion of the common code that will need this. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Common definition for boot state of the slab allocatorsChristoph Lameter1-31/+14
All allocators have some sort of support for the bootstrap status. Setup a common definition for the boot states and make all slab allocators use that definition. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-09mm, sl[aou]b: Extract common code for kmem_cache_create()Christoph Lameter1-15/+9
Kmem_cache_create() does a variety of sanity checks but those vary depending on the allocator. Use the strictest tests and put them into a slab_common file. Make the tests conditional on CONFIG_DEBUG_VM. This patch has the effect of adding sanity checks for SLUB and SLOB under CONFIG_DEBUG_VM and removes the checks in SLAB for !CONFIG_DEBUG_VM. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: move FULL state transition to an initcallGlauber Costa1-4/+4
During kmem_cache_init_late(), we transition to the LATE state, and after some more work, to the FULL state, its last state This is quite different from slub, that will only transition to its last state (previously SYSFS), in a (late)initcall, after a lot more of the kernel is ready. This means that in slab, we have no way to taking actions dependent on the initialization of other pieces of the kernel that are supposed to start way after kmem_init_late(), such as cgroups initialization. To achieve more consistency in this behavior, that patch only transitions to the UP state in kmem_init_late. In my analysis, setup_cpu_cache() should be happy to test for >= UP, instead of == FULL. It also has passed some tests I've made. We then only mark FULL state after the reap timers are in place, meaning that no further setup is expected. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: Fix a typo in commit 8c138b "slab: Get rid of obj_size macro"Feng Tang1-1/+1
Commit 8c138b only sits in Pekka's and linux-next tree now, which tries to replace obj_size(cachep) with cachep->object_size, but has a typo in kmem_cache_free() by using "size" instead of "object_size", which casues some regressions. Reported-and-tested-by: Fengguang Wu <wfg@linux.intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02mm, slab: Build fix for recent kmem_cache changesThierry Reding1-1/+1
Commit 3b0efdf ("mm, sl[aou]b: Extract common fields from struct kmem_cache") renamed the kmem_cache structure's "next" field to "list" but forgot to update one instance in leaks_show(). Signed-off-by: Thierry Reding <thierry.reding@avionic-design.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-07-02slab: rename gfpflags to allocflagsGlauber Costa1-5/+5
A consistent name with slub saves us an acessor function. In both caches, this field represents the same thing. We would like to use it from the mem_cgroup code. Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> CC: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-20slab/mempolicy: always use local policy from interrupt contextAndi Kleen1-2/+2
slab_node() could access current->mempolicy from interrupt context. However there's a race condition during exit where the mempolicy is first freed and then the pointer zeroed. Using this from interrupts seems bogus anyways. The interrupt will interrupt a random process and therefore get a random mempolicy. Many times, this will be idle's, which noone can change. Just disable this here and always use local for slab from interrupts. I also cleaned up the callers of slab_node a bit which always passed the same argument. I believe the original mempolicy code did that in fact, so it's likely a regression. v2: send version with correct logic v3: simplify. fix typo. Reported-by: Arun Sharma <asharma@fb.com> Cc: penberg@kernel.org Cc: cl@linux.com Signed-off-by: Andi Kleen <ak@linux.intel.com> [tdmackey@twitter.com: Rework control flow based on feedback from cl@linux.com, fix logic, and cleanup current task_struct reference] Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: David Mackey <tdmackey@twitter.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Get rid of obj_size macroChristoph Lameter1-26/+21
The size of the slab object is frequently needed. Since we now have a size field directly in the kmem_cache structure there is no need anymore of the obj_size macro/function. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14mm, sl[aou]b: Extract common fields from struct kmem_cacheChristoph Lameter1-59/+58
Define a struct that describes common fields used in all slab allocators. A slab allocator either uses the common definition (like SLOB) or is required to provide members of kmem_cache with the definition given. After that it will be possible to share code that only operates on those fields of kmem_cache. The patch basically takes the slob definition of kmem cache and uses the field namees for the other allocators. It also standardizes the names used for basic object lengths in allocators: object_size Struct size specified at kmem_cache_create. Basically the payload expected to be used by the subsystem. size The size of memory allocator for each object. This size is larger than object_size and includes padding, alignment and extra metadata for each object (f.e. for debugging and rcu). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Remove some accessorsChristoph Lameter1-27/+8
Those are rather trivial now and its better to see inline what is really going on. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-06-14slab: Use page struct fields instead of castingChristoph Lameter1-4/+4
Add fields to the page struct so that it is properly documented that slab overlays the lru fields. This cleans up some casts in slab. Reviewed-by: Glauber Costa <glommer@parallels.com> Reviewed-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-03-29Merge branch 'slab/for-linus' of ↵Linus Torvalds1-4/+52
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "There's the new kmalloc_array() API, minor fixes and performance improvements, but quite honestly, nothing terribly exciting." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: SLAB Out-of-memory diagnostics slab: introduce kmalloc_array() slub: per cpu partial statistics change slub: include include for prefetch slub: Do not hold slub_lock when calling sysfs_slab_add() slub: prefetch next freelist pointer in slab_alloc() slab, cleanup: remove unneeded return
2012-03-22cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman1-5/+8
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-10mm: SLAB Out-of-memory diagnosticsRafael Aquini1-1/+50
Following the example at mm/slub.c, add out-of-memory diagnostics to the SLAB allocator to help on debugging certain OOM conditions. An example print out looks like this: <snip page allocator out-of-memory message> SLAB: Unable to allocate memory on node 0 (gfp=0x11200) cache: bio-0, object size: 192, order: 0 node 0: slabs: 3/3, objs: 60/60, free: 0 Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-01-23slab, cleanup: remove unneeded returnZhao Jin1-3/+2
The procedure ends right after the if-statement, so remove ``return''. Also move the last common statement outside. Signed-off-by: Zhao Jin <cronozhj@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-01-12Merge branch 'slab/for-linus' of ↵Linus Torvalds1-12/+27
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slub: disallow changing cpu_partial from userspace for debug caches slub: add missed accounting slub: Extract get_freelist from __slab_alloc slub: Switch per cpu partial page support off for debugging slub: fix a possible memleak in __slab_alloc() slub: fix slub_max_order Documentation slub: add missed accounting slab: add taint flag outputting to debug paths. slub: add taint flag outputting to debug paths slab: introduce slab_max_order kernel parameter slab: rename slab_break_gfp_order to slab_max_order
2012-01-11Merge branch 'slab/urgent' into slab/for-linusPekka Enberg1-1/+4
2012-01-10tracing/mm: Move include of trace/events/kmem.h out of header into slab.cSteven Rostedt1-0/+2
Including trace/events/*.h TRACE_EVENT() macro headers in other headers can cause strange side effects if another trace/event/*.h header includes that header. Having trace/events/kmem.h inside slab_def.h caused a compile error in sparc64 when changes were done to some header files. Moving the kmem.h trace header out of slab.h and into slab.c fixes the problem. Note, both slub.c and slob.c already include the trace/events/kmem.h file. Only slab.c had it missing. Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-05slab, lockdep: Fix silly bugPeter Zijlstra1-1/+4
Commit 30765b92 ("slab, lockdep: Annotate the locks before using them") moves the init_lock_keys() call from after g_cpucache_up = FULL, to before it. And overlooks the fact that init_node_lock_keys() tests for it and ignores everything !FULL. Introduce a LATE stage and change the lockdep test to be <LATE. Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16slab: add taint flag outputting to debug paths.Dave Jones1-4/+5
When we get corruption reports, it's useful to see if the kernel was tainted, to rule out problems we can't do anything about. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-10slab: introduce slab_max_order kernel parameterDavid Rientjes1-3/+17
Introduce new slab_max_order kernel parameter which is the equivalent of slub_max_order. For immediate purposes, allows users to override the heuristic that sets the max order to 1 by default if they have more than 32MB of RAM. This may result in page allocation failures if there is substantial fragmentation. Another usecase would be to increase the max order for better performance. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-10slab: rename slab_break_gfp_order to slab_max_orderDavid Rientjes1-5/+5
slab_break_gfp_order is more appropriately named slab_max_order since it enforces the maximum order size of slabs as long as a single object will still fit. Also rename BREAK_GFP_ORDER_{LO,HI} accordingly. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27mm: restrict access to slab files under procfs and sysfsVasiliy Kulikov1-1/+1
Historically /proc/slabinfo and files under /sys/kernel/slab/* have world read permissions and are accessible to the world. slabinfo contains rather private information related both to the kernel and userspace tasks. Depending on the situation, it might reveal either private information per se or information useful to make another targeted attack. Some examples of what can be learned by reading/watching for /proc/slabinfo entries: 1) dentry (and different *inode*) number might reveal other processes fs activity. The number of dentry "active objects" doesn't strictly show file count opened/touched by a process, however, there is a good correlation between them. The patch "proc: force dcache drop on unauthorized access" relies on the privacy of dentry count. 2) different inode entries might reveal the same information as (1), but these are more fine granted counters. If a filesystem is mounted in a private mount point (or even a private namespace) and fs type differs from other mounted fs types, fs activity in this mount point/namespace is revealed. If there is a single ecryptfs mount point, the whole fs activity of a single user is revealed. Number of files in ecryptfs mount point is a private information per se. 3) fuse_* reveals number of files / fs activity of a user in a user private mount point. It is approx. the same severity as ecryptfs infoleak in (2). 4) sysfs_dir_cache similar to (2) reveals devices' addition/removal, which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo the precise number of sysfs files is known to the world. 5) buffer_head might reveal some kernel activity. With other information leaks an attacker might identify what specific kernel routines generate buffer_head activity. 6) *kmalloc* infoleaks are very situational. Attacker should watch for the specific kmalloc size entry and filter the noise related to the unrelated kernel activity. If an attacker has relatively silent victim system, he might get rather precise counters. Additional information sources might significantly increase the slabinfo infoleak benefits. E.g. if an attacker knows that the processes activity on the system is very low (only core daemons like syslog and cron), he may run setxid binaries / trigger local daemon activity / trigger network services activity / await sporadic cron jobs activity / etc. and get rather precise counters for fs and network activity of these privileged tasks, which is unknown otherwise. Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate exploitation of kernel heap overflows (and possibly, other bugs). The related discussion: http://thread.gmane.org/gmane.linux.kernel/1108378 To keep compatibility with old permission model where non-root monitoring daemon could watch for kernel memleaks though slabinfo one should do: groupadd slabinfo usermod -a -G slabinfo $MONITOR_USER And add the following commands to init scripts (to mountall.conf in Ubuntu's upstart case): chmod g+r /proc/slabinfo /sys/kernel/slab/*/* chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/* Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: Kees Cook <kees@ubuntu.com> Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> CC: Valdis.Kletnieks@vt.edu CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Alan Cox <alan@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-19Merge branch 'slab/urgent' into slab/nextPekka Enberg1-24/+75
2011-08-04slab, lockdep: Annotate the locks before using themPeter Zijlstra1-3/+3
Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04slab, lockdep: Annotate slab -> rcu -> debug_object -> slabPeter Zijlstra1-18/+68
Lockdep thinks there's lock recursion through: kmem_cache_free() cache_flusharray() spin_lock(&l3->list_lock) <----------------. free_block() | slab_destroy() | call_rcu() | debug_object_activate() | debug_object_init() | __debug_object_init() | kmem_cache_alloc() | cache_alloc_refill() | spin_lock(&l3->list_lock) --' Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no actual possibility of recursing. Luckily debug objects marks it slab with SLAB_DEBUG_OBJECTS so we can identify the thing. Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special lockdep key so that lockdep sees its a different cachep. Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble. Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc> [ fixes to the initial patch ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-31slab: use print_hex_dumpSebastian Andrzej Siewior1-11/+6
Less code and the advantage of ascii dump. before: | Slab corruption: names_cache start=c5788000, len=4096 | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff | 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00 | 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00 | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b after: | Slab corruption: size-4096 start=c38a9000, len=4096 | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 kk....V...$...*. | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ................ | 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00 .....V_......... | 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00 .....V_......... | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-31slab: use NUMA_NO_NODEAndrew Morton1-1/+1
Use the nice enumerated constant. Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-28slab: remove one NR_CPUS dependencyEric Dumazet1-2/+3
Reduce high order allocations in do_tune_cpucache() for some setups. (NR_CPUS=4096 -> we need 64KB) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-22Merge branch 'slab-for-linus' of ↵Linus Torvalds1-8/+9
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slab: fix DEBUG_SLAB warning slab: shrink sizeof(struct kmem_cache) slab: fix DEBUG_SLAB build SLUB: Fix missing <linux/stacktrace.h> include slub: reduce overhead of slub_debug slub: Add method to verify memory is not freed slub: Enable backtrace for create/delete points slab allocators: Provide generic description of alignment defines slab, slub, slob: Unify alignment definition slob/lockdep: Fix gfp flags passed to lockdep
2011-07-22slab: fix DEBUG_SLAB warningTetsuo Handa1-1/+2
In commit c225150b "slab: fix DEBUG_SLAB build", "if ((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))" is always true if ARCH_SLAB_MINALIGN == 0. Do not print warning if ARCH_SLAB_MINALIGN == 0. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-20slab: shrink sizeof(struct kmem_cache)Eric Dumazet1-4/+6
Reduce high order allocations for some setups. (NR_CPUS=4096 -> we need 64KB per kmem_cache struct) We now allocate exact needed size (using nr_cpu_ids and nr_node_ids) This also makes code a bit smaller on x86_64, since some field offsets are less than the 127 limit : Before patch : # size mm/slab.o text data bss dec hex filename 22605 361665 32 384302 5dd2e mm/slab.o After patch : # size mm/slab.o text data bss dec hex filename 22349 353473 8224 384046 5dc2e mm/slab.o CC: Andrew Morton <akpm@linux-foundation.org> Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-18slab: fix DEBUG_SLAB buildHugh Dickins1-4/+2
Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings. Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long), it is always defined (when slab.h included), but cannot be used in #if: mm/slab.c: In function `cache_alloc_debugcheck_after': mm/slab.c:3156:5: warning: "__alignof__" is not defined mm/slab.c:3156:5: error: missing binary operator before token "(" make[1]: *** [mm/slab.o] Error 1 So just remove the #if and #endif lines, but then 64-bit build warns: mm/slab.c: In function `cache_alloc_debugcheck_after': mm/slab.c:3156:6: warning: cast from pointer to integer of different size mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument 3 has type `long unsigned int' Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-06-03SLAB: Record actual last user of freed objects.Suleiman Souhlal1-4/+5
Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or kmem_cache_free() as the last user of free objects, which is not very useful, so change it to the caller of those functions instead. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-20sanitize <linux/prefetch.h> usageLinus Torvalds1-0/+1
Commit e66eed651fd1 ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]') grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-31Fix common misspellingsLucas De Marchi1-2/+2
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-23mm: notifier_from_errno() cleanupPrarit Bhargava1-1/+1
While looking at some other notifier callbacks I noticed this code could use a simple cleanup. notifier_from_errno() no longer needs the if (ret)/else conditional. That same conditional is now done in notifier_from_errno(). Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-11Merge branch 'slab/urgent' into slab/nextPekka Enberg1-4/+4
2011-03-11Merge branch 'slab/rcu' into slab/nextPekka Enberg1-18/+21
Conflicts: mm/slub.c
2011-03-11slab,rcu: don't assume the size of struct rcu_headLai Jiangshan1-18/+21
The size of struct rcu_head may be changed. When it becomes larger, it may pollute the data after struct slab. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-14Revert "slab: Fix missing DEBUG_SLAB last user"Pekka Enberg1-4/+4
This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e. The commit breaks ARM thusly: | Mount-cache hash table entries: 512 | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten | Backtrace: | [<c0227088>] (dump_backtrace+0x0/0x110) from [<c0431afc>] (dump_stack+0x18/0x1c) | [<c0431ae4>] (dump_stack+0x0/0x1c) from [<c0293304>] (__slab_error+0x28/0x30) | [<c02932dc>] (__slab_error+0x0/0x30) from [<c0293a74>] (cache_free_debugcheck+0x1c0/0x2b8) | [<c02938b4>] (cache_free_debugcheck+0x0/0x2b8) from [<c0293f78>] (kmem_cache_free+0x3c/0xc0) | [<c0293f3c>] (kmem_cache_free+0x0/0xc0) from [<c032b1c8>] (ida_get_new_above+0x19c/0x1c0) | [<c032b02c>] (ida_get_new_above+0x0/0x1c0) from [<c02af7ec>] (alloc_vfsmnt+0x54/0x144) | [<c02af798>] (alloc_vfsmnt+0x0/0x144) from [<c0299830>] (vfs_kern_mount+0x30/0xec) | [<c0299800>] (vfs_kern_mount+0x0/0xec) from [<c0299908>] (kern_mount_data+0x1c/0x20) | [<c02998ec>] (kern_mount_data+0x0/0x20) from [<c02146c4>] (sysfs_init+0x68/0xc8) | [<c021465c>] (sysfs_init+0x0/0xc8) from [<c02137d4>] (mnt_init+0x90/0x1b0) | [<c0213744>] (mnt_init+0x0/0x1b0) from [<c0213388>] (vfs_caches_init+0x100/0x140) | [<c0213288>] (vfs_caches_init+0x0/0x140) from [<c0208c0c>] (start_kernel+0x2e8/0x368) | [<c0208924>] (start_kernel+0x0/0x368) from [<c0208034>] (__enable_mmu+0x0/0x2c) | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0. | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten | ... | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump: | | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0 | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00 | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00 | 070: 16 00 00 00 17 00 00 00 c0 88 56 63 | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928! Reference: https://lkml.org/lkml/2011/2/7/238 Cc: <stable@kernel.org> # 2.6.35.y and later Reported-and-analyzed-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-23mm: Remove support for kmem_cache_name()Christoph Lameter1-8/+0
The last user was ext4 and Eric Sandeen removed the call in a recent patch. See the following URL for the discussion: http://marc.info/?l=linux-ext4&m=129546975702198&w=2 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-15mm/slab.c: make local symbols staticH Hartley Sweeten1-3/+3
Local symbols should be static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-10Merge branch 'for-linus' of ↵Linus Torvalds1-15/+23
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix a crash during slabinfo -v tracing/slab: Move kmalloc tracepoint out of inline code slub: Fix slub_lock down/up imbalance slub: Fix build breakage in Documentation/vm slub tracing: move trace calls out of always inlined functions to reduce kernel code size slub: move slabinfo.c to tools/slub/slabinfo.c
2011-01-08Merge branch 'for-2.6.38' of ↵Linus Torvalds1-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits) gameport: use this_cpu_read instead of lookup x86: udelay: Use this_cpu_read to avoid address calculation x86: Use this_cpu_inc_return for nmi counter x86: Replace uses of current_cpu_data with this_cpu ops x86: Use this_cpu_ops to optimize code vmstat: User per cpu atomics to avoid interrupt disable / enable irq_work: Use per cpu atomics instead of regular atomics cpuops: Use cmpxchg for xchg to avoid lock semantics x86: this_cpu_cmpxchg and this_cpu_xchg operations percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support percpu,x86: relocate this_cpu_add_return() and friends connector: Use this_cpu operations xen: Use this_cpu_inc_return taskstats: Use this_cpu_ops random: Use this_cpu_inc_return fs: Use this_cpu_inc_return in buffer.c highmem: Use this_cpu_xx_return() operations vmstat: Use this_cpu_inc_return for vm statistics x86: Support for this_cpu_add, sub, dec, inc_return percpu: Generic support for this_cpu_add, sub, dec, inc_return ... Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c} as per Tejun.