summaryrefslogtreecommitdiff
path: root/arch/powerpc/kernel/time.c
AgeCommit message (Collapse)AuthorFilesLines
2013-01-29powerpc: Max next_tb to prevent from replaying timer interruptTiejun Chen1-2/+7
With lazy interrupt, we always call __check_irq_replaysome with decrementers_next_tb to check if we need to replay timer interrupt. So in hotplug case we also need to set decrementers_next_tb as MAX to make sure __check_irq_replay don't replay timer interrupt when return as we expect, otherwise we'll trap here infinitely. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-03powerpc/vdso: Remove redundant locking in update_vsyscall_tz()Shan Hai1-5/+0
The locking in update_vsyscall_tz() is not only unnecessary because the vdso code copies the data unproteced in __kernel_gettimeofday() but also introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which causes user space process to loop forever in vdso code. The following patch removes the locking from update_vsyscall_tz(). Locking is not only unnecessary because the vdso code copies the data unprotected in __kernel_gettimeofday() but also erroneous because updating the tb_update_count is not atomic and introduces a hard to reproduce race condition between update_vsyscall() and update_vsyscall_tz(), which further causes user space process to loop forever in vdso code. The below scenario describes the race condition, x==0 Boot CPU other CPU proc_P: x==0 timer interrupt update_vsyscall x==1 x++;sync settimeofday update_vsyscall_tz x==2 x++;sync x==3 sync;x++ sync;x++ proc_P: x==3 (loops until x becomes even) Because the ++ operator would be implemented as three instructions and not atomic on powerpc. A similar change was made for x86 in commit 6c260d58634 ("x86: vdso: Remove bogus locking in update_vsyscall_tz") Signed-off-by: Shan Hai <shan.hai@windriver.com> CC: <stable@vger.kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-11-20vtime: Warn if irqs aren't disabled on system time accounting APIsFrederic Weisbecker1-0/+2
System time accounting APIs such as vtime_account_system() and vtime_account_idle() need to be irqsafe. Current callers include irq entry, exit and kvm, all of which have been checked against that requirement. Now it's better to grow that with an automatic check in case we have further callers or we missed something. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19vtime: Consolidate a bit the ctx switch codeFrederic Weisbecker1-6/+0
On ia64 and powerpc, vtime context switch only consists in flushing system and user pending time, plus a few arch housekeeping. Consolidate that into a generic implementation. s390 is a special case because pending user and system time accounting there is hard to dissociate. So it's keeping its own implementation. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19vtime: Explicitly account pending user time on process tickFrederic Weisbecker1-7/+7
All vtime implementations just flush the user time on process tick. Consolidate that in generic code by calling a user time accounting helper. This avoids an indirect call in ia64 and prepare to also consolidate vtime context switch code. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-11-19vtime: Remove the underscore prefix invasionFrederic Weisbecker1-2/+2
Prepending irq-unsafe vtime APIs with underscores was actually a bad idea as the result is a big mess in the API namespace that is even waiting to be further extended. Also these helpers are always called from irq safe callers except kvm. Just provide a vtime_account_system_irqsafe() for this specific case so that we can remove the underscore prefix on other vtime functions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-10-30vtime: Make vtime_account_system() irqsafeFrederic Weisbecker1-2/+2
vtime_account_system() currently has only one caller with vtime_account() which is irq safe. Now we are going to call it from other places like kvm where irqs are not always disabled by the time we account the cputime. So let's make it irqsafe. The arch implementation part is now prefixed with "__". vtime_account_idle() arch implementation is prefixed accordingly to stay consistent. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
2012-10-12Merge branch 'timers-core-for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer core update from Thomas Gleixner: - Bug fixes (one for a longstanding dead loop issue) - Rework of time related vsyscalls - Alarm timer updates - Jiffies updates to remove compile time dependencies * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Cast raw_interval to u64 to avoid shift overflow timers: Fix endless looping between cascade() and internal_add_timer() time/jiffies: bring back unconditional LATCH definition time: Convert x86_64 to using new update_vsyscall time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems time: Introduce new GENERIC_TIME_VSYSCALL time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD time: Move update_vsyscall definitions to timekeeper_internal.h time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes jiffies: Remove compile time assumptions about CLOCK_TICK_RATE jiffies: Kill unused TICK_USEC_TO_NSEC alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue alarmtimer: Remove unused helpers & defines alarmtimer: Use hrtimer per-alarm instead of per-base alarmtimer: Implement minimum alarm interval for allowing suspend
2012-10-05Merge branch 'next' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc Pull powerpc updates from Benjamin Herrenschmidt: "Some highlights in addition to the usual batch of fixes: - 64TB address space support for 64-bit processes by Aneesh Kumar - Gavin Shan did a major cleanup & re-organization of our EEH support code (IBM fancy PCI error handling & recovery infrastructure) which paves the way for supporting different platform backends, along with some rework of the PCIe code for the PowerNV platform in order to remove home made resource allocations and instead use the generic code (which is possible after some small improvements to it done by Gavin). - Uprobes support by Ananth N Mavinakayanahalli - A pile of embedded updates from Freescale folks, including new SoC and board supports, more KVM stuff including preparing for 64-bit BookE KVM support, ePAPR 1.1 updates, etc..." Fixup trivial conflicts in drivers/scsi/ipr.c * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (146 commits) powerpc/iommu: Fix multiple issues with IOMMU pools code powerpc: Fix VMX fix for memcpy case driver/mtd:IFC NAND:Initialise internal SRAM before any write powerpc/fsl-pci: use 'Header Type' to identify PCIE mode powerpc/eeh: Don't release eeh_mutex in eeh_phb_pe_get powerpc: Remove tlb batching hack for nighthawk powerpc: Set paca->data_offset = 0 for boot cpu powerpc/perf: Sample only if SIAR-Valid bit is set in P7+ powerpc/fsl-pci: fix warning when CONFIG_SWIOTLB is disabled powerpc/mpc85xx: Update interrupt handling for IFC controller powerpc/85xx: Enable USB support in p1023rds_defconfig powerpc/smp: Do not disable IPI interrupts during suspend powerpc/eeh: Fix crash on converting OF node to edev powerpc/eeh: Lock module while handling EEH event powerpc/kprobe: Don't emulate store when kprobe stwu r1 powerpc/kprobe: Complete kprobe and migrate exception frame powerpc/kprobe: Introduce a new thread flag powerpc: Remove unused __get_user64() and __put_user64() powerpc/eeh: Global mutex to protect PE tree powerpc/eeh: Remove EEH PE for normal PCI hotplug ...
2012-10-01Merge branch 'sched-core-for-linus' of ↵Linus Torvalds1-20/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes from Ingo Molnar: "Continued quest to clean up and enhance the cputime code by Frederic Weisbecker, in preparation for future tickless kernel features. Other than that, smallish changes." Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) cputime: Make finegrained irqtime accounting generally available cputime: Gather time/stats accounting config options into a single menu ia64: Reuse system and user vtime accounting functions on task switch ia64: Consolidate user vtime accounting vtime: Consolidate system/idle context detection cputime: Use a proper subsystem naming for vtime related APIs sched: cpu_power: enable ARCH_POWER sched/nohz: Clean up select_nohz_load_balancer() sched: Fix load avg vs. cpu-hotplug sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW sched: Fix nohz_idle_balance() sched: Remove useless code in yield_to() sched: Add time unit suffix to sched sysctl knobs sched/debug: Limit sd->*_idx range on sysctl sched: Remove AFFINE_WAKEUPS feature flag s390: Remove leftover account_tick_vtime() header cputime: Consolidate vtime handling on context switch sched: Move cputime code to its own file cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING tile: Remove SD_PREFER_LOCAL leftover ...
2012-09-25vtime: Consolidate system/idle context detectionFrederic Weisbecker1-19/+28
Move the code that finds out to which context we account the cputime into generic layer. Archs that consider the whole time spent in the idle task as idle time (ia64, powerpc) can rely on the generic vtime_account() and implement vtime_account_system() and vtime_account_idle(), letting the generic code to decide when to call which API. Archs that have their own meaning of idle time, such as s390 that only considers the time spent in CPU low power mode as idle time, can just override vtime_account(). Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2012-09-25cputime: Use a proper subsystem naming for vtime related APIsFrederic Weisbecker1-5/+5
Use a naming based on vtime as a prefix for virtual based cputime accounting APIs: - account_system_vtime() -> vtime_account() - account_switch_vtime() -> vtime_task_switch() It makes it easier to allow for further declension such as vtime_account_system(), vtime_account_idle(), ... if we want to find out the context we account to from generic code. This also make it better to know on which subsystem these APIs refer to. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2012-09-24time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLDJohn Stultz1-1/+1
To help migrate archtectures over to the new update_vsyscall method, redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD Cc: Tony Luck <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-24time: Move update_vsyscall definitions to timekeeper_internal.hJohn Stultz1-1/+1
Since users will need to include timekeeper_internal.h, move update_vsyscall definitions to timekeeper_internal.h. Cc: Tony Luck <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Turner <pjt@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <john.stultz@linaro.org>
2012-09-17powerpc/trace: Fix interrupt tracepoints vs. RCULi Zhong1-4/+4
There are a few tracepoints in the interrupt code path, which is before irq_enter(), or after irq_exit(), like trace_irq_entry()/trace_irq_exit() in do_IRQ(), trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in timer_interrupt(). If the interrupt is from idle(), and because tracepoint contains RCU read-side critical section, we could see following suspicious RCU usage reported: [ 145.127743] =============================== [ 145.127747] [ INFO: suspicious RCU usage. ] [ 145.127752] 3.6.0-rc3+ #1 Not tainted [ 145.127755] ------------------------------- [ 145.127759] /root/.workdir/linux/arch/powerpc/include/asm/trace.h:33 suspicious rcu_dereference_check() usage! [ 145.127765] [ 145.127765] other info that might help us debug this: [ 145.127765] [ 145.127771] [ 145.127771] RCU used illegally from idle CPU! [ 145.127771] rcu_scheduler_active = 1, debug_locks = 0 [ 145.127777] RCU used illegally from extended quiescent state! [ 145.127781] no locks held by swapper/0/0. [ 145.127785] [ 145.127785] stack backtrace: [ 145.127789] Call Trace: [ 145.127796] [c00000000108b530] [c000000000013c40] .show_stack +0x70/0x1c0 (unreliable) [ 145.127806] [c00000000108b5e0] [c0000000000f59d8] .lockdep_rcu_suspicious+0x118/0x150 [ 145.127813] [c00000000108b680] [c00000000000fc58] .do_IRQ+0x498/0x500 [ 145.127820] [c00000000108b750] [c000000000003950] hardware_interrupt_common+0x150/0x180 [ 145.127828] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4 [ 145.127828] LR = .check_and_cede_processor+0x38/0x70 [ 145.127836] [c00000000108bab0] [c0000000000665dc] .shared_cede_loop +0x5c/0x100 [ 145.127844] [c00000000108bb70] [c000000000588ab0] .cpuidle_enter +0x30/0x50 [ 145.127850] [c00000000108bbe0] [c000000000588b0c] .cpuidle_enter_state+0x3c/0xb0 [ 145.127857] [c00000000108bc60] [c000000000589730] .cpuidle_idle_call +0x150/0x6c0 [ 145.127863] [c00000000108bd30] [c000000000058440] .pSeries_idle +0x10/0x40 [ 145.127870] [c00000000108bda0] [c00000000001683c] .cpu_idle +0x18c/0x2d0 [ 145.127876] [c00000000108be60] [c00000000000b434] .rest_init +0x124/0x1b0 [ 145.127884] [c00000000108bef0] [c0000000009d0d28] .start_kernel +0x568/0x588 [ 145.127890] [c00000000108bf90] [c000000000009660] .start_here_common +0x20/0x40 This is because the RCU usage in interrupt context should be used in area marked by rcu_irq_enter()/rcu_irq_exit(), called in irq_enter()/irq_exit() respectively. Move them into the irq_enter()/irq_exit() area to avoid the reporting. Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05powerpc: Give hypervisor decrementer interrupts their own handlerPaul Mackerras1-0/+9
At the moment the handler for hypervisor decrementer interrupts is the same as for decrementer interrupts, i.e. timer_interrupt(). This is bogus; if we ever do get a hypervisor decrementer interrupt it won't have anything to do with the next timer event. In fact the only time we get hypervisor decrementer interrupts is when one is left pending on exit from a KVM guest. When we get a hypervisor decrementer interrupt we don't need to do anything special to clear it, since they are edge-triggered on the transition of HDEC from 0 to -1. Thus this adds an empty handler function for them. We don't need to have them masked when interrupts are soft-disabled, so we use STD_EXCEPTION_HV instead of MASKABLE_EXCEPTION_HV. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-08-20cputime: Consolidate vtime handling on context switchFrederic Weisbecker1-0/+6
The archs that implement virtual cputime accounting all flush the cputime of a task when it gets descheduled and sometimes set up some ground initialization for the next task to account its cputime. These archs all put their own hooks in their context switch callbacks and handle the off-case themselves. Consolidate this by creating a new account_switch_vtime() callback called in generic code right after a context switch and that these archs must implement to flush the prev task cputime and initialize the next task cputime related state. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org>
2012-06-08powerpc/time: Sanity check of decrementer expiration is necessaryPaul Mackerras1-3/+11
This reverts 68568add2c ("powerpc/time: Remove unnecessary sanity check of decrementer expiration"). We do need to check whether we have reached the expiration time of the next event, because we sometimes get an early decrementer interrupt, most notably when we set the decrementer to 1 in arch_irq_work_raise(). The effect of not having the sanity check is that if timer_interrupt() gets called early, we leave the decrementer set to its maximum value, which means we then don't get any more decrementer interrupts for about 4 seconds (or longer, depending on timebase frequency). I saw these pauses as a consequence of getting a stray hypervisor decrementer interrupt left over from exiting a KVM guest. This isn't quite a straight revert because of changes to the surrounding code, but it restores the same algorithm as was previously used. Cc: stable@vger.kernel.org Acked-by: Anton Blanchard <anton@samba.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2012-05-06KVM: PPC: Use clockevent multiplier and shifter for decrementerBharat Bhushan1-1/+2
Time for which the hrtimer is started for decrementer emulation is calculated using tb_ticks_per_usec. While hrtimer uses the clockevent for DEC reprogramming (if needed) and which calculate timebase ticks using the multiplier and shifter mechanism implemented within clockevent layer. It was observed that this conversion (timebase->time->timebase) are not correct because the mechanism are not consistent. In our setup it adds 2% jitter. With this patch clockevent multiplier and shifter mechanism are used when starting hrtimer for decrementer emulation. Now the jitter is < 0.5%. Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
2012-03-21powerpc: Remove FW_FEATURE ISERIES from arch codeStephen Rothwell1-105/+3
This is no longer selectable, so just remove all the dependent code. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-03-09powerpc: Rework lazy-interrupt handlingBenjamin Herrenschmidt1-3/+5
The current implementation of lazy interrupts handling has some issues that this tries to address. We don't do the various workarounds we need to do when re-enabling interrupts in some cases such as when returning from an interrupt and thus we may still lose or get delayed decrementer or doorbell interrupts. The current scheme also makes it much harder to handle the external "edge" interrupts provided by some BookE processors when using the EPR facility (External Proxy) and the Freescale Hypervisor. Additionally, we tend to keep interrupts hard disabled in a number of cases, such as decrementer interrupts, external interrupts, or when a masked decrementer interrupt is pending. This is sub-optimal. This is an attempt at fixing it all in one go by reworking the way we do the lazy interrupt disabling from the ground up. The base idea is to replace the "hard_enabled" field with a "irq_happened" field in which we store a bit mask of what interrupt occurred while soft-disabled. When re-enabling, either via arch_local_irq_restore() or when returning from an interrupt, we can now decide what to do by testing bits in that field. We then implement replaying of the missed interrupts either by re-using the existing exception frame (in exception exit case) or via the creation of a new one from an assembly trampoline (in the arch_local_irq_enable case). This removes the need to play with the decrementer to try to create fake interrupts, among others. In addition, this adds a few refinements: - We no longer hard disable decrementer interrupts that occur while soft-disabled. We now simply bump the decrementer back to max (on BookS) or leave it stopped (on BookE) and continue with hard interrupts enabled, which means that we'll potentially get better sample quality from performance monitor interrupts. - Timer, decrementer and doorbell interrupts now hard-enable shortly after removing the source of the interrupt, which means they no longer run entirely hard disabled. Again, this will improve perf sample quality. - On Book3E 64-bit, we now make the performance monitor interrupt act as an NMI like Book3S (the necessary C code for that to work appear to already be present in the FSL perf code, notably calling nmi_enter instead of irq_enter). (This also fixes a bug where BookE perfmon interrupts could clobber r14 ... oops) - We could make "masked" decrementer interrupts act as NMIs when doing timer-based perf sampling to improve the sample quality. Signed-off-by-yet: Benjamin Herrenschmidt <benh@kernel.crashing.org> --- v2: - Add hard-enable to decrementer, timer and doorbells - Fix CR clobber in masked irq handling on BookE - Make embedded perf interrupt act as an NMI - Add a PACA_HAPPENED_EE_EDGE for use by FSL if they want to retrigger an interrupt without preventing hard-enable v3: - Fix or vs. ori bug on Book3E - Fix enabling of interrupts for some exceptions on Book3E v4: - Fix resend of doorbells on return from interrupt on Book3E v5: - Rebased on top of my latest series, which involves some significant rework of some aspects of the patch. v6: - 32-bit compile fix - more compile fixes with various .config combos - factor out the asm code to soft-disable interrupts - remove the C wrapper around preempt_schedule_irq v7: - Fix a bug with hard irq state tracking on native power7
2011-12-19powerpc: Fix wrong divisor in usecs_to_cputimeAndreas Schwab1-5/+5
Commit d57af9b (taskstats: use real microsecond granularity for CPU times) renamed msecs_to_cputime to usecs_to_cputime, but failed to update all numbers on the way. This causes nonsensical cpu idle/iowait values to be displayed in /proc/stat (the only user of usecs_to_cputime so far). This also renames __cputime_msec_factor to __cputime_usec_factor, adapting its value and using it directly in cputime_to_usecs instead of doing two multiplications. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Optimise decrementer_check_overflowAnton Blanchard1-20/+7
decrementer_check_overflow is called from arch_local_irq_restore so we want to make it as light weight as possible. As such, turn decrementer_check_overflow into an inline function. To avoid a circular mess of includes, separate out the two components of struct decrementer_clock and keep the struct clock_event_device part local to time.c. The fast path improves from: arch_local_irq_restore 0: mflr r0 4: std r0,16(r1) 8: stdu r1,-112(r1) c: stb r3,578(r13) 10: cmpdi cr7,r3,0 14: beq- cr7,24 <.arch_local_irq_restore+0x24> ... 24: addi r1,r1,112 28: ld r0,16(r1) 2c: mtlr r0 30: blr to: arch_local_irq_restore 0: std r30,-16(r1) 4: ld r30,0(r2) 8: stb r3,578(r13) c: cmpdi cr7,r3,0 10: beq- cr7,6c <.arch_local_irq_restore+0x6c> ... 6c: ld r30,-16(r1) 70: blr Unfortunately we still setup a local TOC (due to -mminimal-toc). Yet another sign we should be moving to -mcmodel=medium. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Fix some style issuesAnton Blanchard1-11/+11
Fix some formatting issues and use the DECREMENTER_MAX define instead of 0x7fffffff. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Remove unnecessary sanity check of decrementer expirationAnton Blanchard1-11/+3
The clockevents code uses max_delta_ns to avoid calling a clockevent with too large a value. Remove the redundant version of this in the timer_interrupt code. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Use clocksource_register_hzAnton Blanchard1-10/+3
Use clocksource_register_hz which calculates the shift/mult factors for us. Also remove the shift = 22 assumption in vsyscall_update - thanks to Paul Mackerras and John Stultz for catching that. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Use clockevents_calc_mult_shiftAnton Blanchard1-28/+2
We can use clockevents_calc_mult_shift instead of doing all the work ourselves. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-25powerpc/time: Handle wrapping of decrementerAnton Blanchard1-0/+9
When re-enabling interrupts we have code to handle edge sensitive decrementers by resetting the decrementer to 1 whenever it is negative. If interrupts were disabled long enough that the decrementer wrapped to positive we do nothing. This means interrupts can be delayed for a long time until it finally goes negative again. While we hope interrupts are never be disabled long enough for the decrementer to go positive, we have a very good test team that can drive any kernel into the ground. The softlockup data we get back from these fails could be seconds in the future, completely missing the cause of the lockup. We already keep track of the timebase of the next event so use that to work out if we should trigger a decrementer exception. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-11-01powerpc: various straight conversions from module.h --> export.hPaul Gortmaker1-1/+1
All these files were including module.h just for the basic EXPORT_SYMBOL infrastructure. We can shift them off to the export.h header which is a way smaller footprint and thus realize some compile time gains. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-07-01irq_work, ppc: Fix up arch hooksPeter Zijlstra1-1/+1
Commit e360adbe29 ("irq_work: Add generic hardirq context callbacks") fouled up the ppc bit, not properly naming the arch specific function that raises the 'self-IPI'. Cc: Huang Ying <ying.huang@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Anton Blanchard <anton@samba.org> Cc: Eric B Munson <emunson@mgebm.net> Cc: stable@kernel.org # 37+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-eg0aqien8p1aqvzu9dft6dtv@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-04-18powerpc: Fix oops if scan_dispatch_log is called too earlyAnton Blanchard1-0/+3
We currently enable interrupts before the dispatch log for the boot cpu is setup. If a timer interrupt comes in early enough we oops in scan_dispatch_log: Unable to handle kernel paging request for data at address 0x00000010 ... .scan_dispatch_log+0xb0/0x170 .account_system_vtime+0xa0/0x220 .irq_enter+0x88/0xc0 .do_IRQ+0x48/0x230 The patch below adds a check to scan_dispatch_log to ensure the dispatch log has been allocated. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-04-01powerpc: Make decrementer interrupt robust against offlined CPUsBenjamin Herrenschmidt1-4/+11
With some implementations, it is possible that a timer interrupt occurs every few seconds on an offline CPU. In this case, just re-arm the decrementer and return immediately Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-03-30powerpc: Fix accounting of softirq time when idleAnton Blanchard1-1/+1
commit cf9efce0ce31 (powerpc: Account time using timebase rather than PURR) used in_irq() to detect if the time was spent in interrupt processing. This only catches hardirq context so if we are in softirq context and in the idle loop we end up accounting it as idle time. If we instead use in_interrupt() we catch both softirq and hardirq time. The issue was found when running a network intensive workload. top showed the following: 0.0%us, 1.1%sy, 0.0%ni, 85.7%id, 0.0%wa, 9.9%hi, 3.3%si, 0.0%st 85.7% idle. But this was wildly different to the perf events data. To confirm the suspicion I ran something to keep the core busy: # yes > /dev/null & 8.2%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 10.3%hi, 81.4%si, 0.0%st We only got 8.2% of the CPU for the userspace task and softirq has shot up to 81.4%. With the patch below top shows the correct stats: 0.0%us, 0.0%sy, 0.0%ni, 5.3%id, 0.0%wa, 13.3%hi, 81.3%si, 0.0%st Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2011-01-21powerpc/cell: Use system_wq in cpufreq_spudemandTejun Heo1-5/+20
With cmwq, there's no reason to use a separate workqueue in cpufreq_spudemand. Use system_wq instead. The work items are already sync canceled on stop, so it's already guaranteed that no work is running when spu_gov_exit() is entered. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linuxppc-dev@lists.ozlabs.org Cc: Dave Jones <davej@redhat.com> Cc: cpufreq@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-12-09powerpc/time: printk time stamp init not correctHeiko Schocher1-1/+1
problem: I see sometimes on my mpc5200 based board such printk timing information: [ 0.000000] NR_IRQS:512 nr_irqs:512 16 [ 0.000000] MPC52xx PIC is up and running! [ 0.000000] clocksource: timebase mult[79364d9] shift[22] registered [ 0.000000] console [ttyPSC0] enabled [ 130.300633] pid_max: default: 32768 minimum: 301 [ 130.305647] Mount-cache hash table entries: 512 [ 130.315818] NET: Registered protocol family 16 reason: if the tbu not starts from 0 when linux boots, boot_tb maybe could not store the real 64 bit tbu value, because boot_tp is only a 32 bit unsigned long. solution: change boot_tb to u64 [BenH: Made it u64 instead of unsigned long long] Signed-off-by: Heiko Schocher <hs@denx.de> cc: Wolfgang Denk <wd@denx.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-10-22Merge branch 'next' of ↵Linus Torvalds1-142/+133
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (71 commits) powerpc/44x: Update ppc44x_defconfig powerpc/watchdog: Make default timeout for Book-E watchdog a Kconfig option fsl_rio: Add comments for sRIO registers. powerpc/fsl-booke: Add e55xx (64-bit) smp defconfig powerpc/fsl-booke: Add p5020 DS board support powerpc/fsl-booke64: Use TLB CAMs to cover linear mapping on FSL 64-bit chips powerpc/fsl-booke: Add support for FSL Arch v1.0 MMU in setup_page_sizes powerpc/fsl-booke: Add support for FSL 64-bit e5500 core powerpc/85xx: add cache-sram support powerpc/85xx: add ngPIXIS FPGA device tree node to the P1022DS board powerpc: Fix compile error with paca code on ppc64e powerpc/fsl-booke: Add p3041 DS board support oprofile/fsl emb: Don't set MSR[PMM] until after clearing the interrupt. powerpc/fsl-booke: Add PCI device ids for P2040/P3041/P5010/P5020 QoirQ chips powerpc/mpc8xxx_gpio: Add support for 'qoriq-gpio' controllers powerpc/fsl_booke: Add support to boot from core other than 0 powerpc/p1022: Add probing for individual DMA channels powerpc/fsl_soc: Search all global-utilities nodes for rstccr powerpc: Fix invalid page flags in create TLB CAM path for PTE_64BIT powerpc/mpc83xx: Support for MPC8308 P1M board ... Fix up conflict with the generic irq_work changes in arch/powerpc/kernel/time.c
2010-10-18irq_work: Add generic hardirq context callbacksPeter Zijlstra1-21/+21
Provide a mechanism that allows running code in IRQ context. It is most useful for NMI code that needs to interact with the rest of the system -- like wakeup a task to drain buffers. Perf currently has such a mechanism, so extract that and provide it as a generic feature, independent of perf so that others may also benefit. The IRQ context callback is generated through self-IPIs where possible, or on architectures like powerpc the decrementer (the built-in timer facility) is set to generate an interrupt immediately. Architectures that don't have anything like this get to do with a callback from the timer tick. These architectures can call irq_work_run() at the tail of any IRQ handlers that might enqueue such work (like the perf IRQ handler) to avoid undue latencies in processing the work. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [ various fixes ] Signed-off-by: Huang Ying <ying.huang@intel.com> LKML-Reference: <1287036094.7768.291.camel@yhuang-dev> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-10-14powerpc: export ppc_proc_freq and ppc_tb_freq as GPL symbolsTimur Tabi1-1/+2
Export the global variable 'ppc_tb_freq', so that modules (like the Book-E watchdog driver) can use it. To maintain consistency, ppc_proc_freq is changed to a GPL-only export. This is okay, because any module that needs this symbol should be an actual Linux driver, which must be GPL-licensed. Signed-off-by: Timur Tabi <timur@freescale.com> Acked-by: Josh Boyer <jwboyer@linux.vnet.ibm.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
2010-09-02powerpc/pseries: Re-enable dispatch trace log userspace interfacePaul Mackerras1-1/+5
Since the cpu accounting code uses the hypervisor dispatch trace log now when CONFIG_VIRT_CPU_ACCOUNTING = y, the previous commit disabled access to it via files in the /sys/kernel/debug/powerpc/dtl/ directory in that case. This restores those files. To do this, we now have a hook that the cpu accounting code will call as it processes each entry from the hypervisor dispatch trace log. The code in dtl.c now uses that to fill up its ring buffer, rather than having the hypervisor fill the ring buffer directly. This also fixes dtl_file_read() to handle overflow conditions a bit better and adds a spinlock to ensure that race conditions (multiple processes opening or reading the file concurrently) are handled correctly. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-09-02powerpc: Account time using timebase rather than PURRPaul Mackerras1-141/+127
Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the PURR register for measuring the user and system time used by processes, as well as other related times such as hardirq and softirq times. This turns out to be quite confusing for users because it means that a program will often be measured as taking less time when run on a multi-threaded processor (SMT2 or SMT4 mode) than it does when run on a single-threaded processor (ST mode), even though the program takes longer to finish. The discrepancy is accounted for as stolen time, which is also confusing, particularly when there are no other partitions running. This changes the accounting to use the timebase instead, meaning that the reported user and system times are the actual number of real-time seconds that the program was executing on the processor thread, regardless of which SMT mode the processor is in. Thus a program will generally show greater user and system times when run on a multi-threaded processor than on a single-threaded processor. On pSeries systems on POWER5 or later processors, we measure the stolen time (time when this partition wasn't running) using the hypervisor dispatch trace log. We check for new entries in the log on every entry from user mode and on every transition from kernel process context to soft or hard IRQ context (i.e. when account_system_vtime() gets called). So that we can correctly distinguish time stolen from user time and time stolen from system time, without having to check the log on every exit to user mode, we store separate timestamps for exit to user mode and entry from user mode. On systems that have a SPURR (POWER6 and POWER7), we read the SPURR in account_system_vtime() (as before), and then apportion the SPURR ticks since the last time we read it between scaled user time and scaled system time according to the relative proportions of user time and system time over the same interval. This avoids having to read the SPURR on every kernel entry and exit. On systems that have PURR but not SPURR (i.e., POWER5), we do the same using the PURR rather than the SPURR. This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl for now since it conflicts with the use of the dispatch trace log by the time accounting code. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-08-31powerpc/perf_event: Reduce latency of calling perf_event_do_pendingPaul Mackerras1-12/+11
Commit 0fe1ac48 ("powerpc/perf_event: Fix oops due to perf_event_do_pending call") moved the call to perf_event_do_pending in timer_interrupt() down so that it was after the irq_enter() call. Unfortunately this moved it after the code that checks whether it is time for the next decrementer clock event. The result is that the call to perf_event_do_pending() won't happen until the next decrementer clock event is due. This was pointed out by Milton Miller. This fixes it by moving the check for whether it's time for the next decrementer clock event down to the point where we're about to call the event handler, after we've called perf_event_do_pending. This has the side effect that on old pre-Core99 Powermacs where we use the ppc_n_lost_interrupts mechanism to replay interrupts, a replayed interrupt will incur a little more latency since it will now do the code from the irq_enter down to the irq_exit, that it used to skip. However, these machines are now old and rare enough that this doesn't matter. To make it clear that ppc_n_lost_interrupts is only used on Powermacs, and to speed up the code slightly on non-Powermac ppc32 machines, the code that tests ppc_n_lost_interrupts is now conditional on CONFIG_PMAC as well as CONFIG_PPC32. Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-07-28Merge branch 'powerpc.cherry-picks' into timers/clocksourceThomas Gleixner1-133/+9
Conflicts: arch/powerpc/kernel/time.c Reason: The powerpc next tree contains two commits which conflict with the timekeeping changes: 8fd63a9e powerpc: Rework VDSO gettimeofday to prevent time going backwards c1aa687d powerpc: Clean up obsolete code relating to decrementer and timebase John Stultz identified them and provided the conflict resolution. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-28powerpc: Clean up obsolete code relating to decrementer and timebasePaul Mackerras1-133/+3
Since the decrementer and timekeeping code was moved over to using the generic clockevents and timekeeping infrastructure, several variables and functions have been obsolete and effectively unused. This deletes them. In particular, wakeup_decrementer() is no longer needed since the generic code reprograms the decrementer as part of the process of resuming the timekeeping code, which happens during sysdev resume. Thus the wakeup_decrementer calls in the suspend_enter methods for 52xx platforms have been removed. The call in the powermac cpu frequency change code has been replaced by set_dec(1), which will cause a timer interrupt as soon as interrupts are enabled, and the generic code will then reprogram the decrementer with the correct value. This also simplifies the generic_suspend_en/disable_irqs functions and makes them static since they are not referenced outside time.c. The preempt_enable/disable calls are removed because the generic code has disabled all but the boot cpu at the point where these functions are called, so we can't be moved to another cpu. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-07-28powerpc: Rework VDSO gettimeofday to prevent time going backwardsPaul Mackerras1-27/+34
Currently it is possible for userspace to see the result of gettimeofday() going backwards by 1 microsecond, assuming that userspace is using the gettimeofday() in the VDSO. The VDSO gettimeofday() algorithm computes the time in "xsecs", which are units of 2^-20 seconds, or approximately 0.954 microseconds, using the algorithm now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec and then converts the time in xsecs to seconds and microseconds. The kernel updates the tb_orig_stamp and stamp_xsec values every tick in update_vsyscall(). If the length of the tick is not an integer number of xsecs, then some precision is lost in converting the current time to xsecs. For example, with CONFIG_HZ=1000, the tick is 1ms long, which is 1048.576 xsecs. That means that stamp_xsec will advance by either 1048 or 1049 on each tick. With the right conditions, it is possible for userspace to get (timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is slightly late in updating the vdso_datapage, and then for stamp_xsec to advance by 1048 when the kernel does update it, and for userspace to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to integer truncation. The result is that time appears to go backwards by 1 microsecond. To fix this we change the VDSO gettimeofday to use a new field in the VDSO datapage which stores the nanoseconds part of the time as a fractional number of seconds in a 0.32 binary fraction format. (Or put another way, as a 32-bit number in units of 0.23283 ns.) This is convenient because we can use the mulhwu instruction to convert it to either microseconds or nanoseconds. Since it turns out that computing the time of day using this new field is simpler than either using stamp_xsec (as gettimeofday does) or stamp_xtime.tv_nsec (as clock_gettime does), this converts both gettimeofday and clock_gettime to use the new field. The existing __do_get_tspec function is converted to use the new field and take a parameter in r7 that indicates the desired resolution, 1,000,000 for microseconds or 1,000,000,000 for nanoseconds. The __do_get_xsec function is then unused and is deleted. The new algorithm is now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs + (stamp_xtime_seconds << 32) + stamp_sec_fraction with 'now' in units of 2^-32 seconds. That is then converted to seconds and either microseconds or nanoseconds with seconds = now >> 32 partseconds = ((now & 0xffffffff) * resolution) >> 32 The 32-bit VDSO code also makes a further simplification: it ignores the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary fraction. Doing so gets rid of 4 multiply instructions. Assuming a timebase frequency of 1GHz or less and an update interval of no more than 10ms, the upper 32 bits of tb_to_xs will be at least 4503599, so the error from ignoring the low 32 bits will be at most 2.2ns, which is more than an order of magnitude less than the time taken to do gettimeofday or clock_gettime on our fastest processors, so there is no possibility of seeing inconsistent values due to this. This also moves update_gtod() down next to its only caller, and makes update_vsyscall use the time passed in via the wall_time argument rather than accessing xtime directly. At present, wall_time always points to xtime, but that could change in future. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-07-27timkeeping: Fix update_vsyscall to provide wall_to_monotonic offsetJohn Stultz1-4/+4
update_vsyscall() did not provide the wall_to_monotoinc offset, so arch specific implementations tend to reference wall_to_monotonic directly. This limits future cleanups in the timekeeping core, so this patch fixes the update_vsyscall interface to provide wall_to_monotonic, allowing wall_to_monotonic to be made static as planned in Documentation/feature-removal-schedule.txt Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Tony Luck <tony.luck@intel.com> LKML-Reference: <1279068988-21864-7-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-27powerpc: Cleanup xtime usageJohn Stultz1-4/+4
This removes powerpc's direct xtime usage, allowing for further generic timeekeping cleanups Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Anton Blanchard <anton@samba.org> LKML-Reference: <1279068988-21864-6-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-07-27powerpc: Simplify update_vsyscallJohn Stultz1-30/+25
Currently powerpc's update_vsyscall calls an inline update_gtod. However, both are straightforward, and there are no other users, so this patch merges update_gtod into update_vsyscall. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Anton Blanchard <anton@samba.org> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1279068988-21864-5-git-send-email-johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2010-05-12powerpc/perf_event: Fix oops due to perf_event_do_pending callPaul Mackerras1-12/+48
Anton Blanchard found that large POWER systems would occasionally crash in the exception exit path when profiling with perf_events. The symptom was that an interrupt would occur late in the exit path when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts should be hard-disabled at this point but they were enabled. Because the interrupt was not recoverable the system panicked. The reason is that the exception exit path was calling perf_event_do_pending after hard-disabling interrupts, and perf_event_do_pending will re-enable interrupts. The simplest and cleanest fix for this is to use the same mechanism that 32-bit powerpc does, namely to cause a self-IPI by setting the decrementer to 1. This means we can remove the tests in the exception exit path and raw_local_irq_restore. This also makes sure that the call to perf_event_do_pending from timer_interrupt() happens within irq_enter/irq_exit. (Note that calling perf_event_do_pending from timer_interrupt does not mean that there is a possible 1/HZ latency; setting the decrementer to 1 ensures that the timer interrupt will happen immediately, i.e. within one timebase tick, which is a few nanoseconds or 10s of nanoseconds.) Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-17powerpc: Add timer, performance monitor and machine check counts to ↵Anton Blanchard1-0/+2
/proc/interrupts With NO_HZ it is useful to know how often the decrementer is going off. The patch below adds an entry for it and also adds it into the /proc/stat summaries. While here, I added performance monitoring and machine check exceptions. I found it useful to keep an eye on the PMU exception rate when using the perf tool. Since it's possible to take a completely handled machine check on a System p box it also sounds like a good idea to keep a machine check summary. The event naming matches x86 to keep gratuitous differences to a minimum. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2010-02-09powerpc: Only print clockevent settings onceAnton Blanchard1-2/+2
The clockevent multiplier and shift is useful information, but we only need to print it once. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>