summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2011-10-04Merge branch 'linus' into sched/coreIngo Molnar10-53/+43
Merge reason: pick up the latest fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-01Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and ↵Linus Torvalds4-30/+11
'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip * 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: irq: Fix check for already initialized irq_domain in irq_domain_add irq: Add declaration of irq_domain_simple_ops to irqdomain.h * 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86/rtc: Don't recursively acquire rtc_lock * 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: posix-cpu-timers: Cure SMP wobbles sched: Fix up wchan borkage sched/rt: Migrate equal priority tasks to available CPUs
2011-09-30posix-cpu-timers: Cure SMP wobblesPeter Zijlstra2-26/+3
David reported: Attached below is a watered-down version of rt/tst-cpuclock2.c from GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or similar. Run it several times, and you will see cases where the main thread will measure a process clock difference before and after the nanosleep which is smaller than the cpu-burner thread's individual thread clock difference. This doesn't make any sense since the cpu-burner thread is part of the top-level process's thread group. I've reproduced this on both x86-64 and sparc64 (using both 32-bit and 64-bit binaries). For example: [davem@boricha build-x86_64-linux]$ ./test process: before(0.001221967) after(0.498624371) diff(497402404) thread: before(0.000081692) after(0.498316431) diff(498234739) self: before(0.001223521) after(0.001240219) diff(16698) [davem@boricha build-x86_64-linux]$ The diff of 'process' should always be >= the diff of 'thread'. I make sure to wrap the 'thread' clock measurements the most tightly around the nanosleep() call, and that the 'process' clock measurements are the outer-most ones. --- #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <fcntl.h> #include <string.h> #include <errno.h> #include <pthread.h> static pthread_barrier_t barrier; static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) __asm__ __volatile__("" : : : "memory"); return NULL; } int main(void) { clockid_t process_clock, my_thread_clock, th_clock; struct timespec process_before, process_after; struct timespec me_before, me_after; struct timespec th_before, th_after; struct timespec sleeptime; unsigned long diff; pthread_t th; int err; err = clock_getcpuclockid(0, &process_clock); if (err) return 1; err = pthread_getcpuclockid(pthread_self(), &my_thread_clock); if (err) return 1; pthread_barrier_init(&barrier, NULL, 2); err = pthread_create(&th, NULL, chew_cpu, NULL); if (err) return 1; err = pthread_getcpuclockid(th, &th_clock); if (err) return 1; pthread_barrier_wait(&barrier); err = clock_gettime(process_clock, &process_before); if (err) return 1; err = clock_gettime(my_thread_clock, &me_before); if (err) return 1; err = clock_gettime(th_clock, &th_before); if (err) return 1; sleeptime.tv_sec = 0; sleeptime.tv_nsec = 500000000; nanosleep(&sleeptime, NULL); err = clock_gettime(th_clock, &th_after); if (err) return 1; err = clock_gettime(my_thread_clock, &me_after); if (err) return 1; err = clock_gettime(process_clock, &process_after); if (err) return 1; diff = process_after.tv_nsec - process_before.tv_nsec; printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", process_before.tv_sec, process_before.tv_nsec, process_after.tv_sec, process_after.tv_nsec, diff); diff = th_after.tv_nsec - th_before.tv_nsec; printf("thread: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", th_before.tv_sec, th_before.tv_nsec, th_after.tv_sec, th_after.tv_nsec, diff); diff = me_after.tv_nsec - me_before.tv_nsec; printf("self: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n", me_before.tv_sec, me_before.tv_nsec, me_after.tv_sec, me_after.tv_nsec, diff); return 0; } This is due to us using p->se.sum_exec_runtime in thread_group_cputime() where we iterate the thread group and sum all data. This does not take time since the last schedule operation (tick or otherwise) into account. We can cure this by using task_sched_runtime() at the cost of having to take locks. This also means we can (and must) do away with thread_group_sched_runtime() since the modified thread_group_cputime() is now more accurate and would deadlock when called from thread_group_sched_runtime(). Aside of that it makes the function safe on 32 bit systems. The old code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a 64bit value and could be changed on another cpu at the same time. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-30Resource: fix wrong resource window calculationRam Pai1-1/+6
__find_resource() incorrectly returns a resource window which overlaps an existing allocated window. This happens when the parent's resource-window spans 0x00000000 to 0xffffffff and is entirely allocated to all its children resource-windows. __find_resource() looks for gaps in resource allocation among the children resource windows. When it encounters the last child window it blindly tries the range next to one allocated to the last child. Since the last child's window ends at 0xffffffff the calculation overflows, leading the algorithm to believe that any window in the range 0x0000000 to 0xfffffff is available for allocation. This leads to a conflicting window allocation. Michal Ludvig reported this issue seen on his platform. The following patch fixes the problem and has been verified by Michal. I believe this bug has been there for ages. It got exposed by git commit 2bbc6942273b ("PCI : ability to relocate assigned pci-resources") Signed-off-by: Ram Pai <linuxram@us.ibm.com> Tested-by: Michal Ludvig <mludvig@logix.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-26sched: Remove redundant test in check_preempt_tick()Wang Xingchao1-8/+8
The caller already checks for nr_running > 1, therefore we don't have to do so again. Signed-off-by: Wang Xingchao <xingchao.wang@intel.com> Reviewed-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1316194552-12019-1-git-send-email-xingchao.wang@intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-26sched: Fix up wchan borkageSimon Kirby1-1/+1
Commit c259e01a1ec ("sched: Separate the scheduler entry for preemption") contained a boo-boo wrecking wchan output. It forgot to put the new schedule() function in the __sched section and thereby doesn't get properly ignored for things like wchan. Tested-by: Simon Kirby <sim@hostway.ca> Cc: stable@kernel.org # 2.6.39+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-25ptrace: PTRACE_LISTEN forgets to unlock ->siglockOleg Nesterov1-13/+10
If PTRACE_LISTEN fails after lock_task_sighand() it doesn't drop ->siglock. Reported-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-20irq: Fix check for already initialized irq_domain in irq_domain_addRob Herring1-1/+5
The sanity check in irq_domain_add() tests desc->irq_data != NULL or irq_data->domain != NULL. This prevents adding an irq_domain to a irq descriptor when irq_data exists, which true when the irq descriptor exists. This went unnoticed so far as the simple domain code did not enter this code path because domain->nr_irqs is always 0 for the simple domains. Split the check for irq_data == NULL out and have a separate warning for it. [ tglx: Made the check for irq_data == NULL separate ] Signed-off-by: Rob Herring <rob.herring@calxeda.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: marc.zyngier@arm.com Cc: thomas.abraham@linaro.org Cc: jamie@jamieiles.com Cc: b-cousson@ti.com Cc: shawn.guo@linaro.org Cc: linux-arm-kernel@lists.infradead.org Cc: devicetree-discuss@lists.ozlabs.org Link: http://lkml.kernel.org/r/1316017900-19918-3-git-send-email-robherring2@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-20Merge branch 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tipLinus Torvalds1-1/+1
* 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: x86, iommu: Mark DMAR IRQ as non-threaded genirq: Make irq_shutdown() symmetric vs. irq_startup again
2011-09-20Make taskstats round statistics down to nearest 1k bytes/eventsLinus Torvalds1-7/+8
Even with just the interface limited to admin, there really is little to reason to give byte-per-byte counts for taskstats. So round it down to something less intrusive. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-20Make TASKSTATS require root accessLinus Torvalds1-0/+1
Ok, this isn't optimal, since it means that 'iotop' needs admin capabilities, and we may have to work on this some more. But at the same time it is very much not acceptable to let anybody just read anybody elses IO statistics quite at this level. Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative to checking the capabilities by hand. Reported-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Johannes Berg <johannes.berg@intel.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-18Merge branch 'linus' into sched/coreIngo Molnar14-61/+203
Merge reason: We are queueing up a dependent patch. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-18sched/rt: Migrate equal priority tasks to available CPUsShawn Bohrer1-2/+2
Commit 43fa5460fe60dea5c610490a1d263415419c60f6 ("sched: Try not to migrate higher priority RT tasks") also introduced a change in behavior which keeps RT tasks on the same CPU if there is an equal priority RT task currently running even if there are empty CPUs available. This can cause unnecessary wakeup latencies, and can prevent the scheduler from balancing all RT tasks across available CPUs. This change causes an RT task to search for a new CPU if an equal priority RT task is already running on wakeup. Lower priority tasks will still have to wait on higher priority tasks, but the system should still balance out because there is always the possibility that if there are both a high and low priority RT tasks on a given CPU that the high priority task could wakeup while the low priority task is running and force it to search for a better runqueue. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org # 37+ Link: http://lkml.kernel.org/r/1315837684-18733-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-09-15workqueue: lock cwq access in drain_workqueueThomas Tuttle1-1/+6
Take cwq->gcwq->lock to avoid racing between drain_workqueue checking to make sure the workqueues are empty and cwq_dec_nr_in_flight decrementing and then incrementing nr_active when it activates a delayed work. We discovered this when a corner case in one of our drivers resulted in us trying to destroy a workqueue in which the remaining work would always requeue itself again in the same workqueue. We would hit this race condition and trip the BUG_ON on workqueue.c:3080. Signed-off-by: Thomas Tuttle <ttuttle@chromium.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-12genirq: Make irq_shutdown() symmetric vs. irq_startup againGeert Uytterhoeven1-1/+1
If an irq_chip provides .irq_shutdown(), but neither of .irq_disable() or .irq_mask(), free_irq() crashes when jumping to NULL. Fix this by only trying .irq_disable() and .irq_mask() if there's no .irq_shutdown() provided. This revives the symmetry with irq_startup(), which tries .irq_startup(), .irq_enable(), and irq_unmask(), and makes it consistent with the comment for irq_chip.irq_shutdown() in <linux/irq.h>, which says: * @irq_shutdown: shut down the interrupt (defaults to ->disable if NULL) This is also how __free_irq() behaved before the big overhaul, cfr. e.g. 3b56f0585fd4c02d047dc406668cb40159b2d340 ("genirq: Remove bogus conditional"), where the core interrupt code always overrode .irq_shutdown() to .irq_disable() if .irq_shutdown() was NULL. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: linux-m68k@lists.linux-m68k.org Link: http://lkml.kernel.org/r/1315742394-16036-2-git-send-email-geert@linux-m68k.org Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-09-08Merge branch 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tipLinus Torvalds1-5/+13
* 'timers-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: rtc: twl: Fix registration vs. init order rtc: Initialized rtc_time->tm_isdst rtc: Fix RTC PIE frequency limit rtc: rtc-twl: Remove lockdep related local_irq_enable() rtc: rtc-twl: Switch to using threaded irq rtc: ep93xx: Fix 'rtc' may be used uninitialized warning alarmtimers: Avoid possible denial of service with high freq periodic timers alarmtimers: Memset itimerspec passed into alarm_timer_get alarmtimers: Avoid possible null pointer traversal
2011-09-08Merge branch 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tipLinus Torvalds1-15/+26
* 'sched-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip: sched: Fix a memory leak in __sdt_free() sched: Move blk_schedule_flush_plug() out of __schedule() sched: Separate the scheduler entry for preemption
2011-08-31perf_event: Fix broken calc_timer_values()Eric B Munson1-2/+2
We detected a serious issue with PERF_SAMPLE_READ and timing information when events were being multiplexing. Samples would have time_running > time_enabled. That was easy to reproduce with a libpfm4 example (ran 3 times to cause multiplexing on Core 2): $ syst_smpl -e uops_retired:freq=1 & $ syst_smpl -e uops_retired:freq=1 & $ syst_smpl -e uops_retired:freq=1 & IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184 syst_smpl: WARNING: time_running > time_enabled 63277537998 uops_retired:freq=1 , scaled The bug was not present in kernel up to (and including) 3.0. It turns out the bug was introduced by the following commit: commit c4794295917ebeda8013b6cb9c8d71ab4f74a1fa events: Move lockless timer calculation into helper function The parameters of the function got reversed yet the call sites were not updated to reflect the change. That lead to time_running and time_enabled being swapped. That had no effect when there was no multiplexing because in that case time_running = time_enabled but it would show up in any other scenario. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110829124112.GA4828@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29perf events: Fix slow and broken cgroup context switch codeStephane Eranian2-11/+54
The current cgroup context switch code was incorrect leading to bogus counts. Furthermore, as soon as there was an active cgroup event on a CPU, the context switch cost on that CPU would increase by a significant amount as demonstrated by a simple ping/pong example: $ ./pong Both processes pinned to CPU1, running for 10s 10684.51 ctxsw/s Now start a cgroup perf stat: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 $ ./pong Both processes pinned to CPU1, running for 10s 6674.61 ctxsw/s That's a 37% penalty. Note that pong is not even in the monitored cgroup. The results shown by perf stat are bogus: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100 Performance counter stats for 'sleep 100': CPU1 <not counted> cycles test CPU1 16,984,189,138 cycles # 0.000 GHz The second 'cycles' event should report a count @ CPU clock (here 2.4GHz) as it is counting across all cgroups. The patch below fixes the bogus accounting and bypasses any cgroup switches in case the outgoing and incoming tasks are in the same cgroup. With this patch the same test now yields: $ ./pong Both processes pinned to CPU1, running for 10s 10775.30 ctxsw/s Start perf stat with cgroup: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Run pong outside the cgroup: $ /pong Both processes pinned to CPU1, running for 10s 10687.80 ctxsw/s The penalty is now less than 2%. And the results for perf stat are correct: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 <not counted> cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz Now perf stat reports the correct counts for for the non cgroup event. If we run pong inside the cgroup, then we also get the correct counts: $ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10 Performance counter stats for 'sleep 10': CPU1 22,297,726,205 cycles test # 0.000 GHz CPU1 23,933,981,448 cycles # 0.000 GHz 10.001457237 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Fix a memory leak in __sdt_free()WANG Cong1-0/+1
This patch fixes the following memory leak: unreferenced object 0xffff880107266800 (size 512): comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff81133940>] create_object+0x187/0x28b [<ffffffff814ac103>] kmemleak_alloc+0x73/0x98 [<ffffffff811232ba>] __kmalloc_node+0x104/0x159 [<ffffffff81044b98>] kzalloc_node.clone.97+0x15/0x17 [<ffffffff8104cb90>] build_sched_domains+0xb7/0x7f3 [<ffffffff8104d4df>] partition_sched_domains+0x1db/0x24a [<ffffffff8109ee4a>] do_rebuild_sched_domains+0x3b/0x47 [<ffffffff810a00c7>] rebuild_sched_domains+0x10/0x12 [<ffffffff8104d5ba>] sched_power_savings_store+0x6c/0x7b [<ffffffff8104d5df>] sched_mc_power_savings_store+0x16/0x18 [<ffffffff8131322c>] sysdev_class_store+0x20/0x22 [<ffffffff81193876>] sysfs_write_file+0x108/0x144 [<ffffffff81135b10>] vfs_write+0xaf/0x102 [<ffffffff81135d23>] sys_write+0x4d/0x74 [<ffffffff814c8a42>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: WANG Cong <amwang@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org # 3.0 Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Move blk_schedule_flush_plug() out of __schedule()Thomas Gleixner1-10/+15
There is no real reason to run blk_schedule_flush_plug() with interrupts and preemption disabled. Move it into schedule() and call it when the task is going voluntarily to sleep. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from being woken immediately after switching away. This fixes a deadlock in the scheduler where the blk_schedule_flush_plug() callchain enables interrupts and thereby allows a wakeup to happen of the task that's going to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@kernel.org # 2.6.39+ Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-29sched: Separate the scheduler entry for preemptionThomas Gleixner1-5/+10
Block-IO and workqueues call into notifier functions from the scheduler core code with interrupts and preemption disabled. These calls should be made before entering the scheduler core. To simplify this, separate the scheduler core code into __schedule(). __schedule() is directly called from the places which set PREEMPT_ACTIVE and from schedule(). This allows us to add the work checks into schedule(), so they are only called when a task voluntary goes to sleep. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@kernel.org # 2.6.39+ Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.de Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-27All Arch: remove linkage for sys_nfsservctl system callNeilBrown1-1/+0
The nfsservctl system call is now gone, so we should remove all linkage for it. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-26kernel/printk: do not turn off bootconsole in printk_late_init() if keep_bootconNishanth Aravamudan1-1/+1
It seems that 7bf693951a8e ("console: allow to retain boot console via boot option keep_bootcon") doesn't always achieve what it aims, as when printk_late_init() runs it unconditionally turns off all boot consoles. With this patch, I am able to see more messages on the boot console in KVM guests than I can without, when keep_bootcon is specified. I think it is appropriate for the relevant -stable trees. However, it's more of an annoyance than a serious bug (ideally you don't need to keep the boot console around as console handover should be working -- I was encountering a situation where the console handover wasn't working and not having the boot console available meant I couldn't see why). Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <gregkh@suse.de> Acked-by: Fabio M. Di Nitto <fdinitto@redhat.com> Cc: <stable@kernel.org> [2.6.39.x, 3.0.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-25Add a personality to report 2.6.x version numbersAndi Kleen1-0/+38
I ran into a couple of programs which broke with the new Linux 3.0 version. Some of those were binary only. I tried to use LD_PRELOAD to work around it, but it was quite difficult and in one case impossible because of a mix of 32bit and 64bit executables. For example, all kind of management software from HP doesnt work, unless we pretend to run a 2.6 kernel. $ uname -a Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux $ hpacucli ctrl all show Error: No controllers detected. $ rpm -qf /usr/sbin/hpacucli hpacucli-8.75-12.0 Another notable case is that Python now reports "linux3" from sys.platform(); which in turn can break things that were checking sys.platform() == "linux2": https://bugzilla.mozilla.org/show_bug.cgi?id=664564 It seems pretty clear to me though it's a bug in the apps that are using '==' instead of .startswith(), but this allows us to unbreak broken programs. This patch adds a UNAME26 personality that makes the kernel report a 2.6.40+x version number instead. The x is the x in 3.x. I know this is somewhat ugly, but I didn't find a better workaround, and compatibility to existing programs is important. Some programs also read /proc/sys/kernel/osrelease. This can be worked around in user space with mount --bind (and a mount namespace) To use: wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c gcc -o uname26 uname26.c ./uname26 program Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-23Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds2-2/+2
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: fix tracing builds inside the source tree xfs: remove subdirectories xfs: don't expect xfs headers to be in subdirectories
2011-08-23Revert "irq: Always set IRQF_ONESHOT if no primary handler is specified"Linus Torvalds1-1/+0
This reverts commit f3637a5f2e2eb391ff5757bc83fb5de8f9726464. It turns out that this breaks several drivers, one example being OMAP boards which use the on-board OMAP UARTs and the omap-serial driver that will not boot to userspace after the commit. Paul Walmsley reports that enabling CONFIG_DEBUG_SHIRQ reveals 'IRQ handler type mismatch' errors: IRQ handler type mismatch for IRQ 74 current handler: serial idle ... and the reason is that setting IRQF_ONESHOT will now result in those interrupt handlers having different IRQF flags, and thus being unsharable. So the commit log in the reverted commit: "Since it is required for those users and there is no difference for others it makes sense to add this flag unconditionally." is simply not true: there may not be any difference from a "actions at irq time", but there is a *big* difference wrt this flag testing irq management (see __setup_irq() in kernel/irq/manage.c). One solution may be to stop verifying IRQF_ONESHOT in __setup_irq(), but right now the safe course of action is to revert the change. Let's revisit this in a later merge window. Reported-by: Paul Walmsley <paul@pwsan.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Requested-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-19Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds1-5/+16
* 'for-linus' of git://git.kernel.dk/linux-block: (23 commits) Revert "cfq: Remove special treatment for metadata rqs." block: fix flush machinery for stacking drivers with differring flush flags block: improve rq_affinity placement blktrace: add FLUSH/FUA support Move some REQ flags to the common bio/request area allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH xen/blkback: Make description more obvious. cfq-iosched: Add documentation about idling block: Make rq_affinity = 1 work as expected block: swim3: fix unterminated of_device_id table block/genhd.c: remove useless cast in diskstats_show() drivers/cdrom/cdrom.c: relax check on dvd manufacturer value drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse bsg-lib: add module.h include cfq-iosched: Reduce linked group count upon group destruction blk-throttle: correctly determine sync bio loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices loop: add management interface for on-demand device allocation loop: replace linked list of allocated devices with an idr index ...
2011-08-19irqdesc: fix new kernel-doc warningRandy Dunlap1-0/+1
Fix kernel-doc warning in irqdesc.c: Warning(kernel/irq/irqdesc.c:353): No description found for parameter 'owner' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-18Merge branch 'pm-fixes' of ↵Linus Torvalds1-0/+4
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: PM / Domains: Fix build for CONFIG_PM_RUNTIME unset
2011-08-17Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds1-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: lockdep: Fix wrong assumption in match_held_lock
2011-08-17Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds3-18/+40
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: irq: Track the owner of irq descriptor irq: Always set IRQF_ONESHOT if no primary handler is specified genirq: Fix wrong bit operation
2011-08-14PM / Domains: Fix build for CONFIG_PM_RUNTIME unsetRafael J. Wysocki1-0/+4
Function genpd_queue_power_off_work() is not defined for CONFIG_PM_RUNTIME, so pm_genpd_poweroff_unused() causes a build error to happen in that case. Fix the problem by making pm_genpd_poweroff_unused() depend on CONFIG_PM_RUNTIME too. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2011-08-14sched: Return unused runtime on group dequeuePaul Turner2-1/+122
When a local cfs_rq blocks we return the majority of its remaining quota to the global bandwidth pool for use by other runqueues. We do this only when the quota is current and there is more than min_cfs_rq_quota [1ms by default] of runtime remaining on the rq. In the case where there are throttled runqueues and we have sufficient bandwidth to meter out a slice, a second timer is kicked off to handle this delivery, unthrottling where appropriate. Using a 'worst case' antagonist which executes on each cpu for 1ms before moving onto the next on a fairly large machine: no quota generations: 197.47 ms /cgroup/a/cpuacct.usage 199.46 ms /cgroup/a/cpuacct.usage 205.46 ms /cgroup/a/cpuacct.usage 198.46 ms /cgroup/a/cpuacct.usage 208.39 ms /cgroup/a/cpuacct.usage Since we are allowed to use "stale" quota our usage is effectively bounded by the rate of input into the global pool and performance is relatively stable. with quota generations [1s increments]: 119.58 ms /cgroup/a/cpuacct.usage 119.65 ms /cgroup/a/cpuacct.usage 119.64 ms /cgroup/a/cpuacct.usage 119.63 ms /cgroup/a/cpuacct.usage 119.60 ms /cgroup/a/cpuacct.usage The large deficit here is due to quota generations (/intentionally/) preventing us from now using previously stranded slack quota. The cost is that this quota becomes unavailable. with quota generations and quota return: 200.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 198.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 200.06 ms /cgroup/a/cpuacct.usage By returning unused quota we're able to both stably consume our desired quota and prevent unintentional overages due to the abuse of slack quota from previous quota periods (especially on a large machine). Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add exports tracking cfs bandwidth control statisticsNikhil Rao2-0/+28
This change introduces statistics exports for the cpu sub-system, these are added through the use of a stat file similar to that exported by other subsystems. The following exports are included: nr_periods: number of periods in which execution occurred nr_throttled: the number of periods above in which execution was throttle throttled_time: cumulative wall-time that any cpus have been throttled for this group Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Throttle entities exceeding their allowed bandwidthPaul Turner1-2/+50
With the machinery in place to throttle and unthrottle entities, as well as handle their participation (or lack there of) we can now enable throttling. There are 2 points that we must check whether it's time to set throttled state: put_prev_entity() and enqueue_entity(). - put_prev_entity() is the typical throttle path, we reach it by exceeding our allocated run-time within update_curr()->account_cfs_rq_runtime() and going through a reschedule. - enqueue_entity() covers the case of a wake-up into an already throttled group. In this case we know the group cannot be on_rq and can throttle immediately. Checks are added at time of put_prev_entity() and enqueue_entity() Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Migrate throttled tasks on HOTPLUGPaul Turner1-0/+27
Throttled tasks are invisisble to cpu-offline since they are not eligible for selection by pick_next_task(). The regular 'escape' path for a thread that is blocked at offline is via ttwu->select_task_rq, however this will not handle a throttled group since there are no individual thread wakeups on an unthrottle. Resolve this by unthrottling offline cpus so that threads can be migrated. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.989000590@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent buddy interactions with throttled entitiesPaul Turner1-1/+17
Buddies allow us to select "on-rq" entities without actually selecting them from a cfs_rq's rb_tree. As a result we must ensure that throttled entities are not falsely nominated as buddies. The fact that entities are dequeued within throttle_entity is not sufficient for clearing buddy status as the nomination may occur after throttling. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Prevent interactions with throttled entitiesPaul Turner2-7/+94
From the perspective of load-balance and shares distribution, throttled entities should be invisible. However, both of these operations work on 'active' lists and are not inherently aware of what group hierarchies may be present. In some cases this may be side-stepped (e.g. we could sideload via tg_load_down in load balance) while in others (e.g. update_shares()) it is more difficult to compute without incurring some O(n^2) costs. Instead, track hierarchicaal throttled state at time of transition. This allows us to easily identify whether an entity belongs to a throttled hierarchy and avoid incorrect interactions with it. Also, when an entity leaves a throttled hierarchy we need to advance its time averaging for shares averaging so that the elapsed throttled time is not considered as part of the cfs_rq's operation. We also use this information to prevent buddy interactions in the wakeup and yield_to() paths. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Allow for positional tg_tree walksPaul Turner1-13/+37
Extend walk_tg_tree to accept a positional argument static int walk_tg_tree_from(struct task_group *from, tg_visitor down, tg_visitor up, void *data) Existing semantics are preserved, caller must hold rcu_lock() or sufficient analogue. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.677889157@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add support for unthrottling group entitiesPaul Turner2-4/+126
At the start of each period we refresh the global bandwidth pool. At this time we must also unthrottle any cfs_rq entities who are now within bandwidth once more (as quota permits). Unthrottled entities have their corresponding cfs_rq->throttled flag cleared and their entities re-enqueued. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add support for throttling group entitiesPaul Turner2-4/+92
Now that consumption is tracked (via update_curr()) we add support to throttle group entities (and their corresponding cfs_rqs) in the case where this is no run-time remaining. Throttled entities are dequeued to prevent scheduling, additionally we mark them as throttled (using cfs_rq->throttled) to prevent them from becoming re-enqueued until they are unthrottled. A list of a task_group's throttled entities are maintained on the cfs_bandwidth structure. Note: While the machinery for throttling is added in this patch the act of throttling an entity exceeding its bandwidth is deferred until later within the series. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Expire invalid runtimePaul Turner2-10/+84
Since quota is managed using a global state but consumed on a per-cpu basis we need to ensure that our per-cpu state is appropriately synchronized. Most importantly, runtime that is state (from a previous period) should not be locally consumable. We take advantage of existing sched_clock synchronization about the jiffy to efficiently detect whether we have (globally) crossed a quota boundary above. One catch is that the direction of spread on sched_clock is undefined, specifically, we don't know whether our local clock is behind or ahead of the one responsible for the current expiration time. Fortunately we can differentiate these by considering whether the global deadline has advanced. If it has not, then we assume our clock to be "fast" and advance our local expiration; otherwise, we know the deadline has truly passed and we expire our local runtime. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Add a timer to handle CFS bandwidth refreshPaul Turner2-24/+123
This patch adds a per-task_group timer which handles the refresh of the global CFS bandwidth pool. Since the RT pool is using a similar timer there's some small refactoring to share this support. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Accumulate per-cfs_rq cpu usage and charge against bandwidthPaul Turner3-3/+90
Account bandwidth usage on the cfs_rq level versus the task_groups to which they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained under cfs_rq->runtime_enabled. cfs_rq's which belong to a bandwidth constrained task_group have their runtime accounted via the update_curr() path, which withdraws bandwidth from the global pool as desired. Updates involving the global pool are currently protected under cfs_bandwidth->lock, local runtime is protected by rq->lock. This patch only assigns and tracks quota, no action is taken in the case that cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Validate CFS quota hierarchiesPaul Turner1-14/+98
Add constraints validation for CFS bandwidth hierarchies. Validate that: max(child bandwidth) <= parent_bandwidth In a quota limited hierarchy, an unconstrained entity (e.g. bandwidth==RUNTIME_INF) inherits the bandwidth of its parent. This constraint is chosen over sum(child_bandwidth) as notion of over-commit is valuable within SCHED_OTHER. Some basic code from the RT case is re-factored for reuse. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.083774572@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Introduce primitives to account for CFS bandwidth trackingPaul Turner2-4/+208
In this patch we introduce the notion of CFS bandwidth, partitioned into globally unassigned bandwidth, and locally claimed bandwidth. - The global bandwidth is per task_group, it represents a pool of unclaimed bandwidth that cfs_rqs can allocate from. - The local bandwidth is tracked per-cfs_rq, this represents allotments from the global pool bandwidth assigned to a specific cpu. Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem: - cpu.cfs_period_us : the bandwidth period in usecs - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed to consume over period above. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Nikhil Rao <ncrao@google.com> Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched: Implement hierarchical task accounting for SCHED_OTHERPaul Turner4-5/+14
Introduce hierarchical task accounting for the group scheduling case in CFS, as well as promoting the responsibility for maintaining rq->nr_running to the scheduling classes. The primary motivation for this is that with scheduling classes supporting bandwidth throttling it is possible for entities participating in throttled sub-trees to not have root visible changes in rq->nr_running across activate and de-activate operations. This in turn leads to incorrect idle and weight-per-task load balance decisions. This also allows us to make a small fixlet to the fastpath in pick_next_task() under group scheduling. Note: this issue also exists with the existing sched_rt throttling mechanism. This patch does not address that. Signed-off-by: Paul Turner <pjt@google.com> Reviewed-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched/cpupri: Remove cpupri->pri_activeYong Zhang2-4/+1
Since [sched/cpupri: Remove the vec->lock], member pri_active of struct cpupri is not needed any more, just remove it. Also clean stuff related to it. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-14sched/cpupri: Fix memory barriers for vec updates to always be in orderSteven Rostedt1-3/+25
[ This patch actually compiles. Thanks to Mike Galbraith for pointing that out. I compiled and booted this patch with no issues. ] Re-examining the cpupri patch, I see there's a possible race because the update of the two priorities vec->counts are not protected by a memory barrier. When a RT runqueue is overloaded and wants to push an RT task to another runqueue, it scans the RT priority vectors in a loop from lowest priority to highest. When we queue or dequeue an RT task that changes a runqueue's highest priority task, we update the vectors to show that a runqueue is rated at a different priority. To do this, we first set the new priority mask, and increment the vec->count, and then set the old priority mask by decrementing the vec->count. If we are lowering the runqueue's RT priority rating, it will trigger a RT pull, and we do not care if we miss pushing to this runqueue or not. But if we raise the priority, but the priority is still lower than an RT task that is looking to be pushed, we must make sure that this runqueue is still seen by the push algorithm (the loop). Because the loop reads from lowest to highest, and the new priority is set before the old one is cleared, we will either see the new or old priority set and the vector will be checked. But! Since there's no memory barrier between the updates of the two, the old count may be decremented first before the new count is incremented. This means the loop may see the old count of zero and skip it, and also the new count of zero before it was updated. A possible runqueue that the RT task could move to could be missed. A conditional memory barrier is placed between the vec->count updates and is only called when both updates are done. The smp_wmb() has also been changed to smp_mb__before_atomic_inc/dec(), as they are not needed by archs that already synchronize atomic_inc/dec(). The smp_rmb() has been moved to be called at every iteration of the loop so that the race between seeing the two updates is visible by each iteration of the loop, as an arch is free to optimize the reading of memory of the counters in the loop. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1312547269.18583.194.camel@gandalf.stny.rr.com Signed-off-by: Ingo Molnar <mingo@elte.hu>