summaryrefslogtreecommitdiff
path: root/arch/sparc
AgeCommit message (Collapse)AuthorFilesLines
2014-10-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds11-62/+70
Pull two sparc fixes from David Miller: 1) Fix boots with gcc-4.9 compiled sparc64 kernels. 2) Add missing __get_user_pages_fast() on sparc64 to fix hangs on futexes used in transparent hugepage areas. It's really idiotic to have a weak symbolled fallback that just returns zero, and causes this kind of bug. There should be no backup implementation and the link should fail if the architecture fails to provide __get_user_pages_fast() and supports transparent hugepages. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Implement __get_user_pages_fast(). sparc64: Fix register corruption in top-most kernel stack frame during boot.
2014-10-24sparc64: Implement __get_user_pages_fast().David S. Miller1-0/+30
It is not sufficient to only implement get_user_pages_fast(), you must also implement the atomic version __get_user_pages_fast() otherwise you end up using the weak symbol fallback implementation which simply returns zero. This is dangerous, because it causes the futex code to loop forever if transparent hugepages are supported (see get_futex_key()). Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-24sparc64: Fix register corruption in top-most kernel stack frame during boot.David S. Miller10-62/+40
Meelis Roos reported that kernels built with gcc-4.9 do not boot, we eventually narrowed this down to only impacting machines using UltraSPARC-III and derivitive cpus. The crash happens right when the first user process is spawned: [ 54.451346] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004 [ 54.451346] [ 54.571516] CPU: 1 PID: 1 Comm: init Not tainted 3.16.0-rc2-00211-gd7933ab #96 [ 54.666431] Call Trace: [ 54.698453] [0000000000762f8c] panic+0xb0/0x224 [ 54.759071] [000000000045cf68] do_exit+0x948/0x960 [ 54.823123] [000000000042cbc0] fault_in_user_windows+0xe0/0x100 [ 54.902036] [0000000000404ad0] __handle_user_windows+0x0/0x10 [ 54.978662] Press Stop-A (L1-A) to return to the boot prom [ 55.050713] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000004 Further investigation showed that compiling only per_cpu_patch() with an older compiler fixes the boot. Detailed analysis showed that the function is not being miscompiled by gcc-4.9, but it is using a different register allocation ordering. With the gcc-4.9 compiled function, something during the code patching causes some of the %i* input registers to get corrupted. Perhaps we have a TLB miss path into the firmware that is deep enough to cause a register window spill and subsequent restore when we get back from the TLB miss trap. Let's plug this up by doing two things: 1) Stop using the firmware stack for client interface calls into the firmware. Just use the kernel's stack. 2) As soon as we can, call into a new function "start_early_boot()" to put a one-register-window buffer between the firmware's deepest stack frame and the top-most initial kernel one. Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-20Merge git://git.infradead.org/users/eparis/auditLinus Torvalds4-7/+13
Pull audit updates from Eric Paris: "So this change across a whole bunch of arches really solves one basic problem. We want to audit when seccomp is killing a process. seccomp hooks in before the audit syscall entry code. audit_syscall_entry took as an argument the arch of the given syscall. Since the arch is part of what makes a syscall number meaningful it's an important part of the record, but it isn't available when seccomp shoots the syscall... For most arch's we have a better way to get the arch (syscall_get_arch) So the solution was two fold: Implement syscall_get_arch() everywhere there is audit which didn't have it. Use syscall_get_arch() in the seccomp audit code. Having syscall_get_arch() everywhere meant it was a useless flag on the stack and we could get rid of it for the typical syscall entry. The other changes inside the audit system aren't grand, fixed some records that had invalid spaces. Better locking around the task comm field. Removing some dead functions and structs. Make some things static. Really minor stuff" * git://git.infradead.org/users/eparis/audit: (31 commits) audit: rename audit_log_remove_rule to disambiguate for trees audit: cull redundancy in audit_rule_change audit: WARN if audit_rule_change called illegally audit: put rule existence check in canonical order next: openrisc: Fix build audit: get comm using lock to avoid race in string printing audit: remove open_arg() function that is never used audit: correct AUDIT_GET_FEATURE return message type audit: set nlmsg_len for multicast messages. audit: use union for audit_field values since they are mutually exclusive audit: invalid op= values for rules audit: use atomic_t to simplify audit_serial() kernel/audit.c: use ARRAY_SIZE instead of sizeof/sizeof[0] audit: reduce scope of audit_log_fcaps audit: reduce scope of audit_net_id audit: arm64: Remove the audit arch argument to audit_syscall_entry arm64: audit: Add audit hook in syscall_trace_enter/exit() audit: x86: drop arch from __audit_syscall_entry() interface sparc: implement is_32bit_task sparc: properly conditionalize use of TIF_32BIT ...
2014-10-19sparc64: Do not define thread fpregs save area as zero-length array.David S. Miller1-1/+2
This breaks the stack end corruption detection facility. What that facility does it write a magic value to "end_of_stack()" and checking to see if it gets overwritten. "end_of_stack()" is "task_thread_info(p) + 1", which for sparc64 is the beginning of the FPU register save area. So once the user uses the FPU, the magic value is overwritten and the debug checks trigger. Fix this by making the size explicit. Due to the size we use for the fpsaved[], gsr[], and xfsr[] arrays we are limited to 7 levels of FPU state saves. So each FPU register set is 256 bytes, allocate 256 * 7 for the fpregs area. Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-19sparc64: Fix corrupted thread fault code.David S. Miller2-6/+6
Every path that ends up at do_sparc64_fault() must install a valid FAULT_CODE_* bitmask in the per-thread fault code byte. Two paths leading to the label winfix_trampoline (which expects the FAULT_CODE_* mask in register %g4) were not doing so: 1) For pre-hypervisor TLB protection violation traps, if we took the 'winfix_trampoline' path we wouldn't have %g4 initialized with the FAULT_CODE_* value yet. Resulting in using the TLB_TAG_ACCESS register address value instead. 2) In the TSB miss path, when we notice that we are going to use a hugepage mapping, but we haven't allocated the hugepage TSB yet, we still have to take the window fixup case into consideration and in that particular path we leave %g4 not setup properly. Errors on this sort were largely invisible previously, but after commit 4ccb9272892c33ef1c19a783cfa87103b30c2784 ("sparc64: sun4v TLB error power off events") we now have a fault_code mask bit (FAULT_CODE_BAD_RA) that triggers due to this bug. FAULT_CODE_BAD_RA triggers because this bit is set in TLB_TAG_ACCESS (see #1 above) and thus we get seemingly random bus errors triggered for user processes. Fixes: 4ccb9272892c ("sparc64: sun4v TLB error power off events") Reported-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds2-1/+21
Pull Sparc bugfix from David Miller: "Sparc64 AES ctr mode bug fix" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: sparc64: Fix FPU register corruption with AES crypto offload.
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds10-35/+35
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-10-15sparc64: Fix FPU register corruption with AES crypto offload.David S. Miller2-1/+21
The AES loops in arch/sparc/crypto/aes_glue.c use a scheme where the key material is preloaded into the FPU registers, and then we loop over and over doing the crypt operation, reusing those pre-cooked key registers. There are intervening blkcipher*() calls between the crypt operation calls. And those might perform memcpy() and thus also try to use the FPU. The sparc64 kernel FPU usage mechanism is designed to allow such recursive uses, but with a catch. There has to be a trap between the two FPU using threads of control. The mechanism works by, when the FPU is already in use by the kernel, allocating a slot for FPU saving at trap time. Then if, within the trap handler, we try to use the FPU registers, the pre-trap FPU register state is saved into the slot. Then at trap return time we notice this and restore the pre-trap FPU state. Over the long term there are various more involved ways we can make this work, but for a quick fix let's take advantage of the fact that the situation where this happens is very limited. All sparc64 chips that support the crypto instructiosn also are using the Niagara4 memcpy routine, and that routine only uses the FPU for large copies where we can't get the source aligned properly to a multiple of 8 bytes. We look to see if the FPU is already in use in this context, and if so we use the non-large copy path which only uses integer registers. Furthermore, we also limit this special logic to when we are doing kernel copy, rather than a user copy. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-13Merge branch 'locking-arch-for-linus' of ↵Linus Torvalds6-151/+136
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull arch atomic cleanups from Ingo Molnar: "This is a series kept separate from the main locking tree, which cleans up and improves various details in the atomics type handling: - Remove the unused atomic_or_long() method - Consolidate and compress atomic ops implementations between architectures, to reduce linecount and to make it easier to add new ops. - Rewrite generic atomic support to only require cmpxchg() from an architecture - generate all other methods from that" * 'locking-arch-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) locking,arch: Use ACCESS_ONCE() instead of cast to volatile in atomic_read() locking, mips: Fix atomics locking, sparc64: Fix atomics locking,arch: Rewrite generic atomic support locking,arch,xtensa: Fold atomic_ops locking,arch,sparc: Fold atomic_ops locking,arch,sh: Fold atomic_ops locking,arch,powerpc: Fold atomic_ops locking,arch,parisc: Fold atomic_ops locking,arch,mn10300: Fold atomic_ops locking,arch,mips: Fold atomic_ops locking,arch,metag: Fold atomic_ops locking,arch,m68k: Fold atomic_ops locking,arch,m32r: Fold atomic_ops locking,arch,ia64: Fold atomic_ops locking,arch,hexagon: Fold atomic_ops locking,arch,cris: Fold atomic_ops locking,arch,avr32: Fold atomic_ops locking,arch,arm64: Fold atomic_ops locking,arch,arm: Fold atomic_ops ...
2014-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparcLinus Torvalds39-754/+1117
Pull sparc updates from David Miller: 1) Move to 4-level page tables on sparc64 and support up to 53-bits of physical addressing. Kernel static image BSS size reduced by several megabytes. 2) M6/M7 cpu support, from Allan Pais. 3) Move to sparse IRQs, handle hypervisor TLB call errors more gracefully, and add T5 perf_event support. From Bob Picco. 4) Recognize cdroms and compute geometry from capacity in virtual disk driver, also from Allan Pais. 5) Fix memset() return value on sparc32, from Andreas Larsson. 6) Respect gfp flags in dma_alloc_coherent on sparc32, from Daniel Hellstrom. 7) Fix handling of compound pages in virtual disk driver, from Dwight Engen. 8) Fix lockdep warnings in LDC layer by moving IRQ requesting to ldc_alloc() from ldc_bind(). 9) Increase boot string length to 1024 bytes, from Dave Kleikamp. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc: (31 commits) sparc64: Fix lockdep warnings on reboot on Ultra-5 sparc64: Increase size of boot string to 1024 bytes sparc64: Kill unnecessary tables and increase MAX_BANKS. sparc64: sparse irq sparc64: Adjust vmalloc region size based upon available virtual address bits. sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53. sparc64: Use kernel page tables for vmemmap. sparc64: Fix physical memory management regressions with large max_phys_bits. sparc64: Adjust KTSB assembler to support larger physical addresses. sparc64: Define VA hole at run time, rather than at compile time. sparc64: Switch to 4-level page tables. sparc64: Fix reversed start/end in flush_tlb_kernel_range() sparc64: Add vio_set_intr() to enable/disable Rx interrupts vio: fix reuse of vio_dring slot sunvdc: limit each sg segment to a page sunvdc: compute vdisk geometry from capacity sunvdc: add cdrom and v1.1 protocol support sparc: VIO protocol version 1.6 sparc64: Fix hibernation code refrence to PAGE_OFFSET. sparc64: Move request_irq() from ldc_bind() to ldc_alloc() ...
2014-10-10sparc64: Fix lockdep warnings on reboot on Ultra-5David S. Miller1-3/+4
Inconsistently, the raw_* IRQ routines do not interact with and update the irqflags tracing and lockdep state, whereas the raw_* spinlock interfaces do. This causes problems in p1275_cmd_direct() because we disable hardirqs by hand using raw_local_irq_restore() and then do a raw_spin_lock() which triggers a lockdep trace because the CPU's hw IRQ state doesn't match IRQ tracing's internal software copy of that state. The CPU's irqs are disabled, yet current->hardirqs_enabled is true. ==================== reboot: Restarting system ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3536 check_flags+0x7c/0x240() DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) Modules linked in: openpromfs CPU: 0 PID: 1 Comm: systemd-shutdow Tainted: G W 3.17.0-dirty #145 Call Trace: [000000000045919c] warn_slowpath_common+0x5c/0xa0 [0000000000459210] warn_slowpath_fmt+0x30/0x40 [000000000048f41c] check_flags+0x7c/0x240 [0000000000493280] lock_acquire+0x20/0x1c0 [0000000000832b70] _raw_spin_lock+0x30/0x60 [000000000068f2fc] p1275_cmd_direct+0x1c/0x60 [000000000068ed28] prom_reboot+0x28/0x40 [000000000043610c] machine_restart+0x4c/0x80 [000000000047d2d4] kernel_restart+0x54/0x80 [000000000047d618] SyS_reboot+0x138/0x200 [00000000004060b4] linux_sparc_syscall32+0x34/0x60 ---[ end trace 5c439fe81c05a100 ]--- possible reason: unannotated irqs-off. irq event stamp: 2010267 hardirqs last enabled at (2010267): [<000000000049a358>] vprintk_emit+0x4b8/0x580 hardirqs last disabled at (2010266): [<0000000000499f08>] vprintk_emit+0x68/0x580 softirqs last enabled at (2010046): [<000000000045d278>] __do_softirq+0x378/0x4a0 softirqs last disabled at (2010039): [<000000000042bf08>] do_softirq_own_stack+0x28/0x40 Resetting ... ==================== Use local_* variables of the hw IRQ interfaces so that IRQ tracing sees all of our changes. Reported-by: Meelis Roos <mroos@linux.ee> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10nosave: consolidate __nosave_{begin,end} in <asm/sections.h>Geert Uytterhoeven1-3/+1
The different architectures used their own (and different) declarations: extern __visible const void __nosave_begin, __nosave_end; extern const void __nosave_begin, __nosave_end; extern long __nosave_begin, __nosave_end; Consolidate them using the first variant in <asm/sections.h>. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09Merge branch 'timers-nohz-for-linus' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Main changes: - Fix the deadlock reported by Dave Jones et al - Clean up and fix nohz_full interaction with arch abilities - nohz init code consolidation/cleanup" * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: nohz: nohz full depends on irq work self IPI support nohz: Consolidate nohz full init code arm64: Tell irq work about self IPI support arm: Tell irq work about self IPI support x86: Tell irq work about self IPI support irq_work: Force raised irq work to run on irq work interrupt irq_work: Introduce arch_irq_work_has_interrupt() nohz: Move nohz full init call to tick init
2014-10-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds4-15/+66
Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
2014-10-08Merge tag 'tty-3.18-rc1' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here's the big tty/serial driver patchset for 3.18-rc1. Lots of little things in here, some good work from Peter Hurley on the tty core, and in lots of drivers. There are also lots of other driver updates in here as well, full details in the changelogs. All have been in the linux-next tree for a while" * tag 'tty-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (99 commits) Revert "serial/core: Initialize the console pm state" tty: serial: 8250: use 32bit variable for rpm_tx_active tty: serial: msm: Add earlycon support serial/core: Initialize the console pm state serial: asc: Conditionally use readl_relaxed (COMPILE_TEST) serial: of-serial: add PM suspend/resume support m68k: AMIGA_BUILTIN_SERIAL should depend on TTY asm/uapi: Add definition of TIOC[SG]RS485 tty/metag_da: Add console_poll module parameter serial: 8250_pci: remove rts_n override from Baytrail quirk serial: cadence: Add generic earlycon support serial: imx: change the wait even to interruptiable serial: imx: terminate the RX DMA when the UART is suspending serial: imx: fix throttle/unthrottle callbacks for hardware assisted flow control serial: 8250: Add Quark X1000 to 8250_pci.c tty: omap-serial: pull out calculation from baud_is_mode16 tty: omap-serial: fix division by zero xen_hvc: no reason to write the type key on xenstore tty: serial: 8250_core: remove UART_IER_RDI in serial8250_stop_rx() tty: serial: 8250_core: use the ->line argument as a hint in serial8250_find_match_or_unused() ...
2014-10-07sparc64: Increase size of boot string to 1024 bytesDave Kleikamp1-1/+4
This is the longest boot string that silo supports. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Bob Picco <bob.picco@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06sparc64: Kill unnecessary tables and increase MAX_BANKS.David S. Miller3-26/+5
swapper_low_pmd_dir and swapper_pud_dir are actually completely useless and unnecessary. We just need swapper_pg_dir[]. Naturally the other page table chunks will be allocated on an as-needed basis. Since the kernel actually accesses these tables in the PAGE_OFFSET view, there is not even a TLB locality advantage of placing them in the kernel image. Use the hard coded vmlinux.ld.S slot for swapper_pg_dir which is naturally page aligned. Increase MAX_BANKS to 1024 in order to handle heavily fragmented virtual guests. Even with this MAX_BANKS increase, the kernel is 20K+ smaller. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: sparse irqbob picco3-174/+341
This patch attempts to do a few things. The highlights are: 1) enable SPARSE_IRQ unconditionally, 2) kills off !SPARSE_IRQ code 3) allocates ivector_table at boot time and 4) default to cookie only VIRQ mechanism for supported firmware. The first firmware with cookie only support for me appears on T5. You can optionally force the HV firmware to not cookie only mode which is the sysino support. The sysino is a deprecated HV mechanism according to the most recent SPARC Virtual Machine Specification. HV_GRP_INTR is what controls the cookie/sysino firmware versioning. The history of this interface is: 1) Major version 1.0 only supported sysino based interrupt interfaces. 2) Major version 2.0 added cookie based VIRQs, however due to the fact that OSs were using the VIRQs without negoatiating major version 2.0 (Linux and Solaris are both guilty), the VIRQs calls were allowed even with major version 1.0 To complicate things even further, the VIRQ interfaces were only actually hooked up in the hypervisor for LDC interrupt sources. VIRQ calls on other device types would result in HV_EINVAL errors. So effectively, major version 2.0 is unusable. 3) Major version 3.0 was created to signal use of VIRQs and the fact that the hypervisor has these calls hooked up for all interrupt sources, not just those for LDC devices. A new boot option is provided should cookie only HV support have issues. hvirq - this is the version for HV_GRP_INTR. This is related to HV API versioning. The code attempts major=3 first by default. The option can be used to override this default. I've tested with SPARSE_IRQ on T5-8, M7-4 and T4-X and Jalap?no. Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-06sparc64: Adjust vmalloc region size based upon available virtual address bits.David S. Miller4-20/+28
In order to accomodate embedded per-cpu allocation with large numbers of cpus and numa nodes, we have to use as much virtual address space as possible for the vmalloc region. Otherwise we can get things like: PERCPU: max_distance=0x380001c10000 too large for vmalloc space 0xff00000000 So, once we select a value for PAGE_OFFSET, derive the size of the vmalloc region based upon that. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Increase MAX_PHYS_ADDRESS_BITS to 53.David S. Miller3-5/+16
Make sure, at compile time, that the kernel can properly support whatever MAX_PHYS_ADDRESS_BITS is defined to. On M7 chips, use a max_phys_bits value of 49. Based upon a patch by Bob Picco. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Use kernel page tables for vmemmap.David S. Miller3-56/+36
For sparse memory configurations, the vmemmap array behaves terribly and it takes up an inordinate amount of space in the BSS section of the kernel image unconditionally. Just build huge PMDs and look them up just like we do for TLB misses in the vmalloc area. Kernel BSS shrinks by about 2MB. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Fix physical memory management regressions with large max_phys_bits.David S. Miller7-374/+244
If max_phys_bits needs to be > 43 (f.e. for T4 chips), things like DEBUG_PAGEALLOC stop working because the 3-level page tables only can cover up to 43 bits. Another problem is that when we increased MAX_PHYS_ADDRESS_BITS up to 47, several statically allocated tables became enormous. Compounding this is that we will need to support up to 49 bits of physical addressing for M7 chips. The two tables in question are sparc64_valid_addr_bitmap and kpte_linear_bitmap. The first holds a bitmap, with 1 bit for each 4MB chunk of physical memory, indicating whether that chunk actually exists in the machine and is valid. The second table is a set of 2-bit values which tell how large of a mapping (4MB, 256MB, 2GB, 16GB, respectively) we can use at each 256MB chunk of ram in the system. These tables are huge and take up an enormous amount of the BSS section of the sparc64 kernel image. Specifically, the sparc64_valid_addr_bitmap is 4MB, and the kpte_linear_bitmap is 128K. So let's solve the space wastage and the DEBUG_PAGEALLOC problem at the same time, by using the kernel page tables (as designed) to manage this information. We have to keep using large mappings when DEBUG_PAGEALLOC is disabled, and we do this by encoding huge PMDs and PUDs. On a T4-2 with 256GB of ram the kernel page table takes up 16K with DEBUG_PAGEALLOC disabled and 256MB with it enabled. Furthermore, this memory is dynamically allocated at run time rather than coded statically into the kernel image. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Adjust KTSB assembler to support larger physical addresses.David S. Miller2-21/+37
As currently coded the KTSB accesses in the kernel only support up to 47 bits of physical addressing. Adjust the instruction and patching sequence in order to support arbitrary 64 bits addresses. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Define VA hole at run time, rather than at compile time.David S. Miller2-11/+25
Now that we use 4-level page tables, we can provide up to 53-bits of virtual address space to the user. Adjust the VA hole based upon the capabilities of the cpu type probed. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-06sparc64: Switch to 4-level page tables.David S. Miller6-10/+109
This has become necessary with chips that support more than 43-bits of physical addressing. Based almost entirely upon a patch by Bob Picco. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Bob Picco <bob.picco@oracle.com>
2014-10-05sparc64: Fix reversed start/end in flush_tlb_kernel_range()David S. Miller1-2/+2
When we have to split up a flush request into multiple pieces (in order to avoid the firmware range) we don't specify the arguments in the right order for the second piece. Fix the order, or else we get hangs as the code tries to flush "a lot" of entries and we get lockups like this: [ 4422.981276] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [expect:117032] [ 4422.996130] Modules linked in: ipv6 loop usb_storage igb ptp sg sr_mod ehci_pci ehci_hcd pps_core n2_rng rng_core [ 4423.016617] CPU: 12 PID: 117032 Comm: expect Not tainted 3.17.0-rc4+ #1608 [ 4423.030331] task: fff8003cc730e220 ti: fff8003d99d54000 task.ti: fff8003d99d54000 [ 4423.045282] TSTATE: 0000000011001602 TPC: 00000000004521e8 TNPC: 00000000004521ec Y: 00000000 Not tainted [ 4423.064905] TPC: <__flush_tlb_kernel_range+0x28/0x40> [ 4423.074964] g0: 000000000052fd10 g1: 00000001295a8000 g2: ffffff7176ffc000 g3: 0000000000002000 [ 4423.092324] g4: fff8003cc730e220 g5: fff8003dfedcc000 g6: fff8003d99d54000 g7: 0000000000000006 [ 4423.109687] o0: 0000000000000000 o1: 0000000000000000 o2: 0000000000000003 o3: 00000000f0000000 [ 4423.127058] o4: 0000000000000080 o5: 00000001295a8000 sp: fff8003d99d56d01 ret_pc: 000000000052ff54 [ 4423.145121] RPC: <__purge_vmap_area_lazy+0x314/0x3a0> [ 4423.155185] l0: 0000000000000000 l1: 0000000000000000 l2: 0000000000a38040 l3: 0000000000000000 [ 4423.172559] l4: fff8003dae8965e0 l5: ffffffffffffffff l6: 0000000000000000 l7: 00000000f7e2b138 [ 4423.189913] i0: fff8003d99d576a0 i1: fff8003d99d576a8 i2: fff8003d99d575e8 i3: 0000000000000000 [ 4423.207284] i4: 0000000000008008 i5: fff8003d99d575c8 i6: fff8003d99d56df1 i7: 0000000000530c24 [ 4423.224640] I7: <free_vmap_area_noflush+0x64/0x80> [ 4423.234193] Call Trace: [ 4423.239051] [0000000000530c24] free_vmap_area_noflush+0x64/0x80 [ 4423.251029] [0000000000531a7c] remove_vm_area+0x5c/0x80 [ 4423.261628] [0000000000531b80] __vunmap+0x20/0x120 [ 4423.271352] [000000000071cf18] n_tty_close+0x18/0x40 [ 4423.281423] [00000000007222b0] tty_ldisc_close+0x30/0x60 [ 4423.292183] [00000000007225a4] tty_ldisc_reinit+0x24/0xa0 [ 4423.303120] [0000000000722ab4] tty_ldisc_hangup+0xd4/0x1e0 [ 4423.314232] [0000000000719aa0] __tty_hangup+0x280/0x3c0 [ 4423.324835] [0000000000724cb4] pty_close+0x134/0x1a0 [ 4423.334905] [000000000071aa24] tty_release+0x104/0x500 [ 4423.345316] [00000000005511d0] __fput+0x90/0x1e0 [ 4423.354701] [000000000047fa54] task_work_run+0x94/0xe0 [ 4423.365126] [0000000000404b44] __handle_signal+0xc/0x2c Fixes: 4ca9a23765da ("sparc64: Guard against flushing openfirmware mappings.") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-03locking,arch: Use ACCESS_ONCE() instead of cast to volatile in atomic_read()Pranith Kumar2-3/+3
Use the much more reader friendly ACCESS_ONCE() instead of the cast to volatile. This is purely a stylistic change. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Hans-Christian Egtvedt <egtvedt@samfundet.no> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/1411482607-20948-1-git-send-email-bobby.prani@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-10-01sparc64: Add vio_set_intr() to enable/disable Rx interruptsSowmini Varadhan2-1/+14
The vio_set_intr() API should be used by VIO consumers to enable/disable Rx interrupts to facilitate deferred processing in softirq/bottom-half context. Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01vio: fix reuse of vio_dring slotDwight Engen1-1/+1
vio_dring_avail() will allow use of every dring entry, but when the last entry is allocated then dr->prod == dr->cons which is indistinguishable from the ring empty condition. This causes the next allocation to reuse an entry. When this happens in sunvdc, the server side vds driver begins nack'ing the messages and ends up resetting the ldc channel. This problem does not effect sunvnet since it checks for < 2. The fix here is to just never allocate the very last dring slot so that full and empty are not the same condition. The request start path was changed to check for the ring being full a bit earlier, and to stop the blk_queue if there is no space left. The blk_queue will be restarted once the ring is only half full again. The number of ring entries was increased to 512 which matches the sunvnet and Solaris vdc drivers, and greatly reduces the frequency of hitting the ring full condition and the associated blk_queue stop/starting. The checks in sunvent were adjusted to account for vio_dring_avail() returning 1 less. Orabug: 19441666 OraBZ: 14983 Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01sunvdc: add cdrom and v1.1 protocol supportAllen Pais1-3/+9
Interpret the media type from v1.1 protocol to support CDROM/DVD. For v1.0 protocol, a disk's size continues to be calculated from the geometry returned by the vdisk server. The geometry returned by the server can be less than the actual number of sectors available in the backing image/device due to the rounding in the division used to compute the geometry in the vdisk server. In v1.1 protocol a disk's actual size in sectors is returned during the handshake. Use this size when v1.1 protocol is negotiated. Since this size will always be larger than the former geometry computed size, disks created under v1.0 will be forwards compatible to v1.1, but not vice versa. Signed-off-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01sparc: VIO protocol version 1.6David L Stevens2-4/+54
Add VIO protocol version 1.6 interfaces. Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01sunvnet: allow admin to set sunvnet MTUDavid L Stevens1-1/+1
This patch allows an admin to set the MTU on a sunvnet device to arbitrary values between the minimum (68) and maximum (65535) IPv4 packet sizes. Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01sunvnet: upgrade to VIO protocol version 1.6David L Stevens2-4/+54
This patch upgrades the sunvnet driver to support VIO protocol version 1.6. In particular, it adds per-port MTU negotiation, allowing MTUs other than ETH_FRAMELEN with ports using newer VIO protocol versions. Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27sparc64: Fix hibernation code refrence to PAGE_OFFSET.David S. Miller1-2/+2
We changed PAGE_OFFSET to be a variable rather than a constant, but this reference here in the hibernate assembler got missed. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-27sparc: bpf_jit: add support for BPF_LD(X) | BPF_LEN instructionsAlexei Starovoitov1-1/+6
BPF_LD | BPF_W | BPF_LEN instruction is occasionally used by tcpdump and present in 11 tests in lib/test_bpf.c Teach sparc JIT compiler to emit it. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-1/+5
2014-09-24sparc: bpf_jit: fix loads from negative offsetsAlexei Starovoitov2-1/+4
- fix BPF_LD|ABS|IND from negative offsets: make sure to sign extend lower 32 bits in 64-bit register before calling C helpers from JITed code, otherwise 'int k' argument of bpf_internal_load_pointer_neg_helper() function will be added as large unsigned integer, causing packet size check to trigger and abort the program. It's worth noting that JITed code for 'A = A op K' will affect upper 32 bits differently depending whether K is simm13 or not. Since small constants are sign extended, whereas large constants are stored in temp register and zero extended. That is ok and we don't have to pay a penalty of sign extension for every sethi, since all classic BPF instructions have 32-bit semantics and we only need to set correct upper bits when transitioning from JITed code into C. - though instructions 'A &= 0' and 'A *= 0' are odd, JIT compiler should not optimize them out Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-24sparc: Set CONFIG_NET=y in defconfigsMichal Marek1-0/+1
Commit 5d6be6a5 ("scsi_netlink : Make SCSI_NETLINK dependent on NET instead of selecting NET") removed what happened to be the only instance of 'select NET'. Defconfigs that were relying on the select now lack networking support. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: sparclinux@vger.kernel.org Signed-off-by: Michal Marek <mmarek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-24sparc: implement is_32bit_taskEric Paris3-6/+5
We are currently embedding the same check from thread_info.h into syscall.h thanks to the way syscall_get_arch() was implemented in the audit tree. Instead create a new function, is_32bit_task() which is similar to that found on the powerpc arch. This simplifies the syscall.h code and makes the build/Kconfig requirements much easier to understand. Signed-off-by: Eric Paris <eparis@redhat.com Acked-by: David S. Miller <davem@davemloft.net> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: sparclinux@vger.kernel.org
2014-09-24sparc: properly conditionalize use of TIF_32BITStephen Rothwell1-0/+4
After merging the audit tree, today's linux-next build (sparc defconfig) failed like this: In file included from include/linux/audit.h:29:0, from mm/mmap.c:33: arch/sparc/include/asm/syscall.h: In function 'syscall_get_arch': arch/sparc/include/asm/syscall.h:131:9: error: 'TIF_32BIT' undeclared (first use in this function) arch/sparc/include/asm/syscall.h:131:9: note: each undeclared identifier is reported only once for each function it appears in And many more ... Caused by commit 374c0c054122 ("ARCH: AUDIT: implement syscall_get_arch for all arches"). This patch wraps the usage of TIF_32BIT in: if defined(__sparc__) && defined(__arch64__) Which solves the build problem. Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
2014-09-24ARCH: AUDIT: audit_syscall_entry() should not require the archEric Paris1-7/+2
We have a function where the arch can be queried, syscall_get_arch(). So rather than have every single piece of arch specific code use and/or duplicate syscall_get_arch(), just have the audit code use the syscall_get_arch() code. Based-on-patch-by: Richard Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: linux-alpha@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-ia64@vger.kernel.org Cc: microblaze-uclinux@itee.uq.edu.au Cc: linux-mips@linux-mips.org Cc: linux@lists.openrisc.net Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-s390@vger.kernel.org Cc: linux-sh@vger.kernel.org Cc: sparclinux@vger.kernel.org Cc: user-mode-linux-devel@lists.sourceforge.net Cc: linux-xtensa@linux-xtensa.org Cc: x86@kernel.org
2014-09-24ARCH: AUDIT: implement syscall_get_arch for all archesEric Paris1-0/+8
For all arches which support audit implement syscall_get_arch() They are all pretty easy and straight forward, stolen from how the call to audit_syscall_entry() determines the arch. Based-on-patch-by: Richard Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Cc: linux-ia64@vger.kernel.org Cc: microblaze-uclinux@itee.uq.edu.au Cc: linux-mips@linux-mips.org Cc: linux@lists.openrisc.net Cc: linux-parisc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: sparclinux@vger.kernel.org
2014-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-7/+18
Conflicts: arch/mips/net/bpf_jit.c drivers/net/can/flexcan.c Both the flexcan and MIPS bpf_jit conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-20sparc: bpf_jit: fix support for ldx/stx mem and SKF_AD_VLAN_TAGAlexei Starovoitov1-7/+18
fix several issues in sparc BPF JIT compiler. ldx/stx related: . classic BPF instructions that access mem[] slots were not setting SEEN_MEM flag, so stack wasn't allocated. Fix that by advertising correct flags . LDX/STX instructions were missing SEEN_XREG, so register value could have leaked to user space. Fix it. . since stack for mem[] slots is allocated with 'sub %sp' instead of 'save %sp', use %sp as base register instead of %fp. . ldx mem[0] means first slot in classic BPF which should have -4 offset instead of 0. . sparc64 needs 2047 stack bias as per ABI to access stack . emit_stmem() was using LD32I macro instead of ST32I SKF_AD_VLAN_TAG* related: . SKF_AD_VLAN_TAG_PRESENT must return 1 or 0 instead of '> 0' or 0 as per classic BPF de facto standard . SKF_AD_VLAN_TAG needs to mask the field correctly Fixes: 2809a2087cc4 ("net: filter: Just In Time compiler for sparc") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-19sparc: bpf_jit: add SKF_AD_PKTTYPE support to JITAlexei Starovoitov1-7/+2
commit 233577a22089 ("net: filter: constify detection of pkt_type_offset") allows us to implement simple PKTTYPE support in sparc JIT Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-17sparc64: Move request_irq() from ldc_bind() to ldc_alloc()Sowmini Varadhan4-26/+28
The request_irq() needs to be done from ldc_alloc() to avoid the following (caught by lockdep) [00000000004a0738] __might_sleep+0xf8/0x120 [000000000058bea4] kmem_cache_alloc_trace+0x184/0x2c0 [00000000004faf80] request_threaded_irq+0x80/0x160 [000000000044f71c] ldc_bind+0x7c/0x220 [0000000000452454] vio_port_up+0x54/0xe0 [00000000101f6778] probe_disk+0x38/0x220 [sunvdc] [00000000101f6b8c] vdc_port_probe+0x22c/0x300 [sunvdc] [0000000000451a88] vio_device_probe+0x48/0x60 [000000000074c56c] really_probe+0x6c/0x300 [000000000074c83c] driver_probe_device+0x3c/0xa0 [000000000074c92c] __driver_attach+0x8c/0xa0 [000000000074a6ec] bus_for_each_dev+0x6c/0xa0 [000000000074c1dc] driver_attach+0x1c/0x40 [000000000074b0fc] bus_add_driver+0xbc/0x280 Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Dwight Engen <dwight.engen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-17sparc64: T5 PMUbob picco5-5/+73
The T5 (niagara5) has different PCR related HV fast trap values and a new HV API Group. This patch utilizes these and shares when possible with niagara4. We use the same sparc_pmu niagara4_pmu. Should there be new effort to obtain the MCU perf statistics then this would have to be changed. Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-17sparc64: mem boot option correctionbob picco2-16/+51
The "mem" boot option can result in many unexpected consequences. This patch attempts to prevent boot hangs which have been experienced on T4-4 and T5-8. Basically the boot loader allocates vmlinuz and initrd higher in available OBP physical memory. For example, on a 2Tb T5-8 it isn't possible to boot with mem=20G. The patch utilizes memblock to avoid reserved regions and trim memory which is only free. Other improvements are possible for a multi-node machine. This is a snippet of the boot log with mem=20G on T5-8 with the patch applied: MEMBLOCK configuration: <- before memory reduction memory size = 0x1ffad6ce000 reserved size = 0xa1adf44 memory.cnt = 0xb memory[0x0] [0x00000030400000-0x00003fdde47fff], 0x3fada48000 bytes memory[0x1] [0x00003fdde4e000-0x00003fdde4ffff], 0x2000 bytes memory[0x2] [0x00080000000000-0x00083fffffffff], 0x4000000000 bytes memory[0x3] [0x00100000000000-0x00103fffffffff], 0x4000000000 bytes memory[0x4] [0x00180000000000-0x00183fffffffff], 0x4000000000 bytes memory[0x5] [0x00200000000000-0x00203fffffffff], 0x4000000000 bytes memory[0x6] [0x00280000000000-0x00283fffffffff], 0x4000000000 bytes memory[0x7] [0x00300000000000-0x00303fffffffff], 0x4000000000 bytes memory[0x8] [0x00380000000000-0x00383fffc71fff], 0x3fffc72000 bytes memory[0x9] [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes memory[0xa] [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes reserved.cnt = 0x2 reserved[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes reserved[0x1] [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes ... MEMBLOCK configuration: <- after reduction of memory memory size = 0x50a1adf44 reserved size = 0xa1adf44 memory.cnt = 0x4 memory[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes memory[0x1] [0x00380004000000-0x0038050d01d74a], 0x50901d74b bytes memory[0x2] [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes memory[0x3] [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes reserved.cnt = 0x2 reserved[0x0] [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes reserved[0x1] [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes ... Early memory node ranges node 7: [mem 0x380000000000-0x38000117dfff] node 7: [mem 0x380004000000-0x380f0d01bfff] node 7: [mem 0x383fffc92000-0x383fffca1fff] node 7: [mem 0x383fffcb4000-0x383fffcb5fff] Could not find start_pfn for node 0 Could not find start_pfn for node 1 Could not find start_pfn for node 2 Could not find start_pfn for node 3 Could not find start_pfn for node 4 Could not find start_pfn for node 5 Could not find start_pfn for node 6 . The patch was tested on T4-1, T5-8 and Jalap?no. Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-17sparc64: find_node adjustmentbob picco1-1/+4
We have seen an issue with guest boot into LDOM that causes early boot failures because of no matching rules for node identitity of the memory. I analyzed this on my T4 and concluded there might not be a solution. I saw the issue in mainline too when booting into the control/primary domain - with guests configured. Note, this could be a firmware bug on some older machines. I'll provide a full explanation of the issues below. Should we not find a matching BEST latency group for a real address (RA) then we will assume node 0. On the T4-2 here with the information provided I can't see an alternative. Technically the LDOM shown below should match the MBLOCK to the favorable latency group. However other factors must be considered too. Were the memory controllers configured "fine" grained interleave or "coarse" grain interleaved - T4. Also should a "group" MD node be considered a NUMA node? There has to be at least one Machine Description (MD) "group" and hence one NUMA node. The group can have one or more latency groups (lg) - more than one memory controller. The current code chooses the smallest latency as the most favorable per group. The latency and lg information is in MLGROUP below. MBLOCK is the base and size of the RAs for the machine as fetched from OBP /memory "available" property. My machine has one MBLOCK but more would be possible - with holes? For a T4-2 the following information has been gathered: with LDOM guest MEMBLOCK configuration: memory size = 0x27f870000 memory.cnt = 0x3 memory[0x0] [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes memory[0x1] [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes memory[0x2] [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes reserved.cnt = 0x2 reserved[0x0] [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes reserved[0x1] [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes MBLOCK[0]: base[20000000] size[280000000] offset[0] (note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values) (note: (RA + offset) & mask = val is the formula to detect a match for the memory controller. should there be no match for find_node node, a return value of -1 resulted for the node - BAD) There is one group. It has these forward links MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000] MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000] NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8]) (note: "val" is the best lg's (smallest latency) "match") no LDOM guest - bare metal MEMBLOCK configuration: memory size = 0xfdf2d0000 memory.cnt = 0x3 memory[0x0] [0x00000020400000-0x00000fff6adfff], 0xfdf2ae000 bytes memory[0x1] [0x00000fff6d2000-0x00000fff6e7fff], 0x16000 bytes memory[0x2] [0x00000fff766000-0x00000fff771fff], 0xc000 bytes reserved.cnt = 0x2 reserved[0x0] [0x00000020800000-0x00000021a04580], 0x1204581 bytes reserved[0x1] [0x00000024800000-0x0000002c7d29fc], 0x7fd29fd bytes MBLOCK[0]: base[20000000] size[fe0000000] offset[0] there are two groups group node[16d5] MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000] MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000] NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8]) group node[171d] MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000] MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000] NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8]) (note: for this two "group" bare metal machine, 1/2 memory is in group one's lg and 1/2 memory is in group two's lg). Cc: sparclinux@vger.kernel.org Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>