summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2010-10-07Merge commit 'v2.6.36-rc7' into core/rcuIngo Molnar14-67/+161
Merge reason: Update from -rc3 to -rc7. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-07Merge branch 'rcu/urgent' of ↵Ingo Molnar25-148/+327
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
2010-10-07HWPOISON: Copy si_addr_lsb to userAndi Kleen1-0/+8
The original hwpoison code added a new siginfo field si_addr_lsb to pass the granuality of the fault address to user space. Unfortunately this field was never copied to user space. Fix this here. I added explicit checks for the MCEERR codes to avoid having to patch all potential callers to initialize the field. Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-10-06rcu: move check from rcu_dereference_bh to rcu_read_lock_bh_heldPaul E. McKenney1-1/+1
As suggested by Linus, push the irqs_disabled() down to the rcu_read_lock_bh_held() level so that all callers get the benefit of the correct check. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-10-06Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds1-3/+14
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: rcu_read_lock_bh_held(): disabling irqs also disables bh generic-ipi: Fix deadlock in __smp_call_function_single
2010-10-05modules: Fix module_bug_list list corruption raceLinus Torvalds1-0/+4
With all the recent module loading cleanups, we've minimized the code that sits under module_mutex, fixing various deadlocks and making it possible to do most of the module loading in parallel. However, that whole conversion totally missed the rather obscure code that adds a new module to the list for BUG() handling. That code was doubly obscure because (a) the code itself lives in lib/bugs.c (for dubious reasons) and (b) it gets called from the architecture-specific "module_finalize()" rather than from generic code. Calling it from arch-specific code makes no sense what-so-ever to begin with, and is now actively wrong since that code isn't protected by the module loading lock any more. So this commit moves the "module_bug_{finalize,cleanup}()" calls away from the arch-specific code, and into the generic code - and in the process protects it with the module_mutex so that the list operations are now safe. Future fixups: - move the module list handling code into kernel/module.c where it belongs. - get rid of 'module_bug_list' and just use the regular list of modules (called 'modules' - imagine that) that we already create and maintain for other reasons. Reported-and-tested-by: Thomas Gleixner <tglx@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Adrian Bunk <bunk@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-05workqueue: add queue_work and activate_work trace pointsTejun Heo1-0/+3
These two tracepoints allow tracking when and how a work is queued and activated. This patch is based on Frederic's patch to add queue_work trace point. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
2010-10-05workqueue: prepare for more tracepointsTejun Heo1-3/+3
Define workqueue_work event class and use it for workqueue_execute_end trace point. Also, move trace/events/workqueue.h include downwards such that all struct definitions are visible to it. This is to prepare for more tracepoints and doesn't cause any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Frederic Weisbecker <fweisbec@gmail.com>
2010-10-04BKL: Remove BKL from cgroupJan Blunck1-8/+0
The BKL is only used in remount_fs and get_sb that are both protected by the superblocks s_umount rw_semaphore. Therefore it is safe to remove the BKL entirely. Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04BKL: Explicitly add BKL around get_sb/fill_superJan Blunck1-0/+4
This patch is a preparation necessary to remove the BKL from do_new_mount(). It explicitly adds calls to lock_kernel()/unlock_kernel() around get_sb/fill_super operations for filesystems that still uses the BKL. I've read through all the code formerly covered by the BKL inside do_kern_mount() and have satisfied myself that it doesn't need the BKL any more. do_kern_mount() is already called without the BKL when mounting the rootfs and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called from various places without BKL: simple_pin_fs(), nfs_do_clone_mount() through nfs_follow_mountpoint(), afs_mntpt_do_automount() through afs_mntpt_follow_link(). Both later functions are actually the filesystems follow_link inode operation. vfs_kern_mount() is calling the specified get_sb function and lets the filesystem do its job by calling the given fill_super function. Therefore I think it is safe to push down the BKL from the VFS to the low-level filesystems get_sb/fill_super operation. [arnd: do not add the BKL to those file systems that already don't use it elsewhere] Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Christoph Hellwig <hch@infradead.org>
2010-10-04genirq: Provide config option to disable deprecated codeThomas Gleixner5-5/+32
This option covers now the old chip functions and the irq_desc data fields which are moving to struct irq_data. More stuff will follow. Pretty handy for testing a conversion, whether something broke or not. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04perf_events: Fix invalid pointer when pid is invalidStephane Eranian1-1/+6
This patch fixes an error in perf_event_open() when the pid provided by the user is invalid. find_lively_task_by_vpid() does not return NULL on error but an error code. Without the fix the error code was silently passed to find_get_context() which would eventually cause a invalid pointer dereference. Signed-off-by: Stephane Eranian <eranian@google.com> Cc: peterz@infradead.org Cc: paulus@samba.org Cc: davem@davemloft.net Cc: fweisbec@gmail.com Cc: perfmon2-devel@lists.sf.net Cc: eranian@gmail.com Cc: robert.richter@amd.com LKML-Reference: <4ca9a5d1.e8e9d80a.3dbb.ffff8f2e@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->retrigger()Thomas Gleixner2-2/+9
Wrap the old chip function retrigger() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121843.025801092@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->set_wake()Thomas Gleixner2-2/+9
Wrap the old chip function set_wake() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.927527393@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->set_type()Thomas Gleixner3-7/+15
Wrap the old chip function set_type() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.832261548@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->set_affinity()Thomas Gleixner4-8/+19
Wrap the old chip function set_affinity() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.732894108@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->startup()Thomas Gleixner4-16/+16
Wrap the old chip function startup() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.635152961@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->disable()/shutdown()Thomas Gleixner5-22/+35
Wrap the old chip functions disable() and shutdown() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.532070631@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->enable()Thomas Gleixner3-8/+21
Wrap the old chip function enable() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.437159182@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->eoi()Thomas Gleixner1-3/+10
Wrap the old chip function eoi() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.339657617@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->mask_ack()Thomas Gleixner1-6/+13
Wrap the old chip function mask_ack() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.240806983@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->ack()Thomas Gleixner2-18/+12
Wrap the old chip function ack() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.142624725@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->unmask()Thomas Gleixner4-9/+16
Wrap the old chip function unmask() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121842.043608928@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for chip->mask()Thomas Gleixner3-9/+16
Wrap the old chip function mask() until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.940355859@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide compat handling for bus_lock/bus_sync_unlockThomas Gleixner3-19/+35
Wrap the old chip functions for bus_lock/bus_sync_unlock until the migration is complete and the old chip functions are removed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.842536121@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Add new functions to dummy chipsThomas Gleixner2-17/+44
The compat functions go away when the core code is converted. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Convert core code to irq_dataThomas Gleixner10-105/+106
Convert all references in the core code to orq, chip, handler_data, chip_data, msi_desc, affinity to irq_data.* Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Create irq_dataThomas Gleixner1-20/+19
Low level chip functions need access to irq_desc->handler_data, irq_desc->chip_data and irq_desc->msi_desc. We hand down the irq number to the low level functions, so they need to lookup irq_desc. With sparse irq this means a radix tree lookup. We could hand down irq_desc itself, but low level chip functions have no need to fiddle with it directly and we want to restrict access to irq_desc further. Preparatory patch for new chip functions. Note, that the ugly anon union/struct is there to avoid a full tree wide clean up for now. This is not going to last 3 years like __do_IRQ() Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121841.645542300@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-04genirq: Provide KconfigThomas Gleixner1-0/+54
The generic irq Kconfig options are copied around all archs. Provide a generic Kconfig file which can be included. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> LKML-Reference: <20100927121843.217333624@linutronix.de> Reviewed-by: H. Peter Anvin <hpa@zytor.com> Reviewed-by: Ingo Molnar <mingo@elte.hu>
2010-10-01kfifo: fix scatterlist usageIra W. Snyder1-2/+0
The kfifo_dma family of functions use sg_mark_end() on the last element in their scatterlist. This forces use of a fresh scatterlist for each DMA operation, which makes recycling a single scatterlist impossible. Change the behavior of the kfifo_dma functions to match the usage of the dma_map_sg function. This means that users must respect the returned nents value. The sample code is updated to reflect the change. This bug is trivial to cause: call kfifo_dma_in_prepare() such that it prepares a scatterlist with a single entry comprising the whole fifo. This is the case when you map the entirety of a newly created empty fifo. This causes the setup_sgl() function to mark the first scatterlist entry as the end of the chain, no matter what comes after it. Afterwards, add and remove some data from the fifo such that another call to kfifo_dma_in_prepare() will create two scatterlist entries. It returns nents=2. However, due to the previous sg_mark_end() call, sg_is_last() will now return true for the first scatterlist element. This causes the sample code to print a single scatterlist element when it should print two. By removing the call to sg_mark_end(), we make the API as similar as possible to the DMA mapping API. All users are required to respect the returned nents. Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-24Merge branch 'tip/perf/core' of ↵Ingo Molnar6-9/+456
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/core
2010-09-23genirq: Cleanup irq_chip->typename leftoversThomas Gleixner1-2/+0
3 years transition phase is enough. Cleanup the last users and remove the cruft. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Leo Chen <leochen@broadcom.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Chris Zankel <chris@zankel.net>
2010-09-23rcu: Add tracing data to support queueing modelsPaul E. McKenney3-3/+13
The current tracing data is not sufficient to deduce the average time that a callback spends waiting for a grace period to end. Add three per-CPU counters recording the number of callbacks invoked (ci), the number of callbacks orphaned (co), and the number of callbacks adopted (ca). Given the existing callback queue length (ql), the average wait time in absence of CPU hotplug operations is ql/ci. The units of wait time will be in terms of the duration over which ci was measured. In the presence of CPU hotplug operations, there is room for argument, but ql/(ci-co+ca) won't steer you too far wrong. Also fixes a typo called out by Lucas De Marchi <lucas.de.marchi@gmail.com>. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-09-23rcu: fix sparse errors in rcutorture.cPaul E. McKenney1-4/+7
Add the sparse __rcu address-space identifier and make a couple of variables static. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-09-23kernel: Remove undead ifdef CONFIG_DEBUG_LOCK_ALLOCChristian Dietrich1-2/+0
The CONFIG_DEBUG_LOCK_ALLOC ifdef isn't necessary at this point, because it is checked in an outer ifdef level already and has no effect here. Signed-off-by: Christian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-09-23kernel/pm_qos_params.c: Remove unnecessary casts of private_dataJoe Perches1-1/+1
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-09-23powerpc: define a compat_sys_recv cond_syscallStephen Rothwell1-0/+1
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-23rmap: fix walk during forkAndrea Arcangeli1-1/+1
The below bug in fork led to the rmap walk finding the parent huge-pmd twice instead of just once, because the anon_vma_chain objects of the child vma still point to the vma->vm_mm of the parent. The patch fixes it by making the rmap walk accurate during fork. It's not a big deal normally but it worth being accurate considering the cost is the same. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-23jump label: Tracepoint support for jump labelsJason Baron1-2/+12
Make use of the jump label infrastructure for tracepoints. Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <a9ba2056e2c9cf332c3c300b577463ce66ff23a8.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-23jump label: Add jump_label_text_reserved() to reserve jump pointsJason Baron2-1/+85
Add a jump_label_text_reserved(void *start, void *end), so that other pieces of code that want to modify kernel text, can first verify that jump label has not reserved the instruction. Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <06236663a3a7b1c1f13576bb9eccb6d9c17b7bfe.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-23jump label: Initialize workqueue tracepoints *before* they are registeredJason Baron1-5/+5
Initialize the workqueue data structures *before* they are registered so that they are ready for callbacks. Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <e3a3383fc370ac7086625bebe89d9480d7caf372.1284733808.git.jbaron@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-23jump label: Base patch for jump labelJason Baron4-1/+354
base patch to implement 'jump labeling'. Based on a new 'asm goto' inline assembly gcc mechanism, we can now branch to labels from an 'asm goto' statment. This allows us to create a 'no-op' fastpath, which can subsequently be patched with a jump to the slowpath code. This is useful for code which might be rarely used, but which we'd like to be able to call, if needed. Tracepoints are the current usecase that these are being implemented for. Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason Baron <jbaron@redhat.com> LKML-Reference: <ee8b3595967989fdaf84e698dc7447d315ce972a.1284733808.git.jbaron@redhat.com> [ cleaned up some formating ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-22Merge branch 'linus' into perf/coreIngo Molnar2-5/+5
Conflicts: kernel/hw_breakpoint.c Merge reason: resolve the conflict. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-22softirqs: Make wakeup_softirqd staticThomas Gleixner1-1/+1
No users outside of kernel/softirq.c Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-09-22Merge branch 'sched-fixes-for-linus' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix nohz balance kick sched: Fix user time incorrectly accounted as system time on 32-bit
2010-09-22Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hw breakpoints: Fix pid namespace bug x86: Fix instruction breakpoint encoding oprofile: Add Support for Intel CPU Family 6 / Model 22 (Intel Celeron 540) kprobes: Fix Kconfig dependency
2010-09-21tracing/sched: Add sched_pi_setprio tracepointSteven Rostedt1-0/+1
Add a tracepoint that shows the priority of a task being boosted via priority inheritance. Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2010-09-21sched: Give CPU bound RT tasks preferenceSteven Rostedt1-3/+5
If a high priority task is waking up on a CPU that is running a lower priority task that is bound to a CPU, see if we can move the high RT task to another CPU first. Note, if all other CPUs are running higher priority tasks than the CPU bounded current task, then it will be preempted regardless. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21sched: Try not to migrate higher priority RT tasksSteven Rostedt1-10/+12
When first working on the RT scheduler design, we concentrated on keeping all CPUs running RT tasks instead of having multiple RT tasks on a single CPU waiting for the migration thread to move them. Instead we take a more proactive stance and push or pull RT tasks from one CPU to another on wakeup or scheduling. When an RT task wakes up on a CPU that is running another RT task, instead of preempting it and killing the cache of the running RT task, we look to see if we can migrate the RT task that is waking up, even if the RT task waking up is of higher priority. This may sound a bit odd, but RT tasks should be limited in migration by the user anyway. But in practice, people do not do this, which causes high prio RT tasks to bounce around the CPUs. This becomes even worse when we have priority inheritance, because a high prio task can block on a lower prio task and boost its priority. When the lower prio task wakes up the high prio task, if it happens to be on the same CPU it will migrate off of it. But in reality, the above does not happen much either, because the wake up of the lower prio task, which has already been boosted, if it was on the same CPU as the higher prio task, it would then migrate off of it. But anyway, we do not want to migrate them either. To examine the scheduling, I created a test program and examined it under kernelshark. The test program created CPU * 2 threads, where each thread had a different priority. The program takes different options. The options used in this change log was to have priority inheritance mutexes or not. All threads did the following loop: static void grab_lock(long id, int iter, int l) { ftrace_write("thread %ld iter %d, taking lock %d\n", id, iter, l); pthread_mutex_lock(&locks[l]); ftrace_write("thread %ld iter %d, took lock %d\n", id, iter, l); busy_loop(nr_tasks - id); ftrace_write("thread %ld iter %d, unlock lock %d\n", id, iter, l); pthread_mutex_unlock(&locks[l]); } void *start_task(void *id) { [...] while (!done) { for (l = 0; l < nr_locks; l++) { grab_lock(id, i, l); ftrace_write("thread %ld iter %d sleeping\n", id, i); ms_sleep(id); } i++; } [...] } The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes to the ftrace buffer to help analyze via ftrace. The higher the id, the higher the prio, the shorter it does the busy loop, but the longer it spins. This is usually the case with RT tasks, the lower priority tasks usually run longer than higher priority tasks. At the end of the test, it records the number of loops each thread took, as well as the number of voluntary preemptions, non-voluntary preemptions, and number of migrations each thread took, taking the information from /proc/$$/sched and /proc/$$/status. Running this on a 4 CPU processor, the results without changes to the kernel looked like this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 53 3220 1470 98 1: 562 773 724 98 2: 752 933 1375 98 3: 749 39 697 98 4: 758 5 515 98 5: 764 2 679 99 6: 761 2 535 99 7: 757 3 346 99 total: 5156 4977 6341 787 Each thread regardless of priority migrated a few hundred times. The higher priority tasks, were a little better but still took quite an impact. By letting higher priority tasks bump the lower prio task from the CPU, things changed a bit: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 37 2835 1937 98 1: 666 1821 1865 98 2: 654 1003 1385 98 3: 664 635 973 99 4: 698 197 352 99 5: 703 101 159 99 6: 708 1 75 99 7: 713 1 2 99 total: 4843 6594 6748 789 The total # of migrations did not change (several runs showed the difference all within the noise). But we now see a dramatic improvement to the higher priority tasks. (kernelshark showed that the watchdog timer bumped the highest priority task to give it the 2 count. This was actually consistent with every run). Notice that the # of iterations did not change either. The above was with priority inheritance mutexes. That is, when the higher prority task blocked on a lower priority task, the lower priority task would inherit the higher priority task (which shows why task 6 was bumped so many times). When not using priority inheritance mutexes, the current kernel shows this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 56 3101 1892 95 1: 594 713 937 95 2: 625 188 618 95 3: 628 4 491 96 4: 640 7 468 96 5: 631 2 501 96 6: 641 1 466 96 7: 643 2 497 96 total: 4458 4018 5870 765 Not much changed with or without priority inheritance mutexes. But if we let the high priority task bump lower priority tasks on wakeup we see: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 115 3439 2782 98 1: 633 1354 1583 99 2: 652 919 1218 99 3: 645 713 934 99 4: 690 3 3 99 5: 694 1 4 99 6: 720 3 4 99 7: 747 0 1 100 Which shows a even bigger change. The big difference between task 3 and task 4 is because we have only 4 CPUs on the machine, causing the 4 highest prio tasks to always have preference. Although I did not measure cache misses, and I'm sure there would be little to measure since the test was not data intensive, I could imagine large improvements for higher priority tasks when dealing with lower priority tasks. Thus, I'm satisfied with making the change and agreeing with what Gregory Haskins argued a few years ago when we first had this discussion. One final note. All tasks in the above tests were RT tasks. Any RT task will always preempt a non RT task that is running on the CPU the RT task wants to run on. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-09-21sched: Increment cache_nice_tries only on periodic lbVenkatesh Pallipadi1-1/+8
scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>