Age | Commit message (Collapse) | Author | Files | Lines |
|
We don't need the NULL check of np, the result is the same because the
OF helpers cope with NULL, of_node_to_nid(NULL) == NUMA_NO_NODE (-1).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313112020.28235-1-mpe@ellerman.id.au
|
|
The reason debuggers add an ASCII dump to other types of memory dumps
is to give the user visual reference points in the case that ASCII
strings are adjacent to other structures or element. For example,
when examining the task_struct structure one can look for the comm[]
string and use it to locate other important elements.
ASCII strings do not have endianess, they exist in memory in the same
order regardless of CPU endianess. ASCII strings are, by definition,
human readable and so should be presented in a human readable format.
For these reasons, the supplemental ASCII dump does not re-order
the strings from memory to match the endianess of the corresponding
16, 32, or 64 bit words. That would make the ASCII dump much less
useful.
Signed-off-by: Douglas Miller <dougmill@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/1488205694-13337-1-git-send-email-dougmill@linux.vnet.ibm.com
|
|
As does XMON, the debugfs file /sys/kernel/debug/powerpc/xive exposes
the XIVE internal state of the machine CPUs and interrupts. Available
on the PowerNV and sPAPR platforms.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
[mpe: Make the debugfs file 0400]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306150143.5551-5-clg@kaod.org
|
|
Some firmwares or hypervisors can advertise different source
characteristics. Track their value under XMON. What we are mostly
interested in is the StoreEOI flag.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306150143.5551-4-clg@kaod.org
|
|
The PowerNV platform has multiple IRQ chips and the xmon command
dumping the state of the XIVE interrupt should only operate on the
XIVE IRQ chip.
Fixes: 5896163f7f91 ("powerpc/xmon: Improve output of XIVE interrupts")
Cc: stable@vger.kernel.org # v5.4+
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306150143.5551-3-clg@kaod.org
|
|
When a CPU is brought up, an IPI number is allocated and recorded
under the XIVE CPU structure. Invalid IPI numbers are tracked with
interrupt number 0x0.
On the PowerNV platform, the interrupt number space starts at 0x10 and
this works fine. However, on the sPAPR platform, it is possible to
allocate the interrupt number 0x0 and this raises an issue when CPU 0
is unplugged. The XIVE spapr driver tracks allocated interrupt numbers
in a bitmask and it is not correctly updated when interrupt number 0x0
is freed. It stays allocated and it is then impossible to reallocate.
Fix by using the XIVE_BAD_IRQ value instead of zero on both platforms.
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Fixes: eac1e731b59e ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306150143.5551-2-clg@kaod.org
|
|
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
[mpe: Drop changes to a/p/boot which doesn't use linux headers]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190812215052.71840-10-ndesaulniers@google.com
|
|
At present, on Power systems with Protected Execution Facility
hardware and an ultravisor, a KVM guest can transition to being a
secure guest at will. Userspace (QEMU) has no way of knowing
whether a host system is capable of running secure guests. This
will present a problem in future when the ultravisor is capable of
migrating secure guests from one host to another, because
virtualization management software will have no way to ensure that
secure guests only run in domains where all of the hosts can
support secure guests.
This adds a VM capability which has two functions: (a) userspace
can query it to find out whether the host can support secure guests,
and (b) userspace can enable it for a guest, which allows that
guest to become a secure guest. If userspace does not enable it,
KVM will return an error when the ultravisor does the hypercall
that indicates that the guest is starting to transition to a
secure guest. The ultravisor will then abort the transition and
the guest will terminate.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
|
|
The core device API performs extra housekeeping bits that are missing
from directly calling cpu_up/down.
See commit a6717c01ddc2 ("powerpc/rtas: use device model APIs and
serialization during LPM") for an example description of what might go
wrong.
This also prepares to make cpu_up/down() a private interface of the CPU
subsystem.
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lkml.kernel.org/r/20200323135110.30522-11-qais.yousef@arm.com
|
|
Add SPDX License Identifier to all .gitignore files.
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The if statement that this comment referred to was removed in
commit 11fdb309341c ("powerpc/prom_init: Remove support for OPAL v2").
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200324182912.1048906-1-farosas@linux.ibm.com
|
|
When a program check exception happens while MMU translation is
disabled, following Oops happens in kprobe_handler() in the following
code:
} else if (*addr != BREAKPOINT_INSTRUCTION) {
BUG: Unable to handle kernel data access on read at 0x0000e268
Faulting instruction address: 0xc000ec34
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=16K PREEMPT CMPC885
Modules linked in:
CPU: 0 PID: 429 Comm: cat Not tainted 5.6.0-rc1-s3k-dev-00824-g84195dc6c58a #3267
NIP: c000ec34 LR: c000ecd8 CTR: c019cab8
REGS: ca4d3b58 TRAP: 0300 Not tainted (5.6.0-rc1-s3k-dev-00824-g84195dc6c58a)
MSR: 00001032 <ME,IR,DR,RI> CR: 2a4d3c52 XER: 00000000
DAR: 0000e268 DSISR: c0000000
GPR00: c000b09c ca4d3c10 c66d0620 00000000 ca4d3c60 00000000 00009032 00000000
GPR08: 00020000 00000000 c087de44 c000afe0 c66d0ad0 100d3dd6 fffffff3 00000000
GPR16: 00000000 00000041 00000000 ca4d3d70 00000000 00000000 0000416d 00000000
GPR24: 00000004 c53b6128 00000000 0000e268 00000000 c07c0000 c07bb6fc ca4d3c60
NIP [c000ec34] kprobe_handler+0x128/0x290
LR [c000ecd8] kprobe_handler+0x1cc/0x290
Call Trace:
[ca4d3c30] [c000b09c] program_check_exception+0xbc/0x6fc
[ca4d3c50] [c000e43c] ret_from_except_full+0x0/0x4
--- interrupt: 700 at 0xe268
Instruction dump:
913e0008 81220000 38600001 3929ffff 91220000 80010024 bb410008 7c0803a6
38210020 4e800020 38600000 4e800020 <813b0000> 6d2a7fe0 2f8a0008 419e0154
---[ end trace 5b9152d4cdadd06d ]---
kprobe is not prepared to handle events in real mode and functions
running in real mode should have been blacklisted, so kprobe_handler()
can safely bail out telling 'this trap is not mine' for any trap that
happened while in real-mode.
If the trap happened with MSR_IR or MSR_DR cleared, return 0
immediately.
Reported-by: Larry Finger <Larry.Finger@lwfinger.net>
Fixes: 6cc89bad60a6 ("powerpc/kprobes: Invoke handlers directly")
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/424331e2006e7291a1bfe40e7f3fa58825f565e1.1582054578.git.christophe.leroy@c-s.fr
|
|
When building ppc64 defconfig, Clang errors (trimmed for brevity):
arch/powerpc/platforms/maple/setup.c:365:1: error: attribute declaration
must precede definition [-Werror,-Wignored-attributes]
machine_device_initcall(maple, maple_cpc925_edac_setup);
^
machine_device_initcall expands to __define_machine_initcall, which in
turn has the macro machine_is used in it, which declares mach_##name
with an __attribute__((weak)). define_machine actually defines
mach_##name, which in this file happens before the declaration, hence
the warning.
To fix this, move define_machine after machine_device_initcall so that
the declaration occurs before the definition, which matches how
machine_device_initcall and define_machine work throughout
arch/powerpc.
While we're here, remove some spaces before tabs.
Fixes: 8f101a051ef0 ("edac: cpc925 MC platform device setup")
Reported-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Ilie Halip <ilie.halip@gmail.com>
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200323222729.15365-1-natechancellor@gmail.com
|
|
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200320152436.1468651-1-npiggin@gmail.com
|
|
With the EEH early probe now being pseries specific there's no need for
eeh_ops->probe() to take a pci_dn. Instead, we can make it take a pci_dev
and use the probe function to map a pci_dev to an eeh_dev. This allows
the platform to implement it's own method for finding (or creating) an
eeh_dev for a given pci_dev which also removes a use of pci_dn in
generic EEH code.
This patch also renames eeh_device_add_late() to eeh_device_probe(). This
better reflects what it does does and removes the last vestiges of the
early/late EEH probe split.
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-6-oohall@gmail.com
|
|
The eeh_ops->probe() function is called from two different contexts:
1. On pseries, where we set EEH_PROBE_MODE_DEVTREE, it's called in
eeh_add_device_early() which is supposed to run before we create
a pci_dev.
2. On PowerNV, where we set EEH_PROBE_MODE_DEV, it's called in
eeh_device_add_late() which is supposed to run *after* the
pci_dev is created.
The "early" probe is required because PAPR requires that we perform an RTAS
call to enable EEH support on a device before we start interacting with it
via config space or MMIO. This requirement doesn't exist on PowerNV and
shoehorning two completely separate initialisation paths into a common
interface just results in a convoluted code everywhere.
Additionally the early probe requires the probe function to take an pci_dn
rather than a pci_dev argument. We'd like to make pci_dn a pseries specific
data structure since there's no real requirement for them on PowerNV. To
help both goals move the early probe into the pseries containment zone
so the platform depedence is more explicit.
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-5-oohall@gmail.com
|
|
This check for a missing PHB has existing in various forms since the
initial PPC64 port was upstreamed in 2002. The idea seems to be that we
need to guard against creating pci-specific data structures for the non-pci
children of a PCI device tree node (e.g. USB devices). However, we only
create pci_dn structures for DT nodes that correspond to PCI devices so
there's not much point in doing this check in the eeh_probe path.
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-4-oohall@gmail.com
|
|
The pci hotplug helper (pci_hp_add_devices()) calls
eeh_add_device_tree_early() to scan the device-tree for new PCI devices and
do the early EEH probe before the device is scanned. This early probe is a
no-op in a lot of cases because:
a) The early init is only required to satisfy a PAPR requirement that EEH
be configured before we start doing config accesses. On PowerNV it is
a no-op.
b) It's a no-op for devices that have already had their eeh_dev
initialised.
There are four callers of pci_hp_add_devices():
1. arch/powerpc/kernel/eeh_driver.c
Here the hotplug helper is called when re-scanning pci_devs that
were removed during an EEH recovery pass. The EEH stat for each
removed device (the eeh_dev) is retained across a recovery pass
so the early init is a no-op in this case.
2. drivers/pci/hotplug/pnv_php.c
This is also a no-op since the PowerNV hotplug driver is, suprisingly,
PowerNV specific.
3. drivers/pci/hotplug/rpaphp_core.c
4. drivers/pci/hotplug/rpaphp_pci.c
In these two cases new devices have been hotplugged and FW has
provided new DT nodes for each. These are the only two cases where
the EEH we might have new PCI device nodes in the DT so these are
the only two cases where the early EEH probe needs to be done.
We can move the calls to eeh_add_device_tree_early() to the locations where
it's needed and remove it from the generic path. This is preparation for
making the early EEH probe pseries specific.
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-3-oohall@gmail.com
|
|
On pseries and PowerNV pcibios_bus_add_device() calls eeh_add_device_late()
so there's no need to do a separate tree traversal to bind the eeh_dev and
pci_dev together setting up the PHB at boot. As a result we can remove
eeh_add_device_tree_late().
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-2-oohall@gmail.com
|
|
Move creating the EEH specific sysfs files into eeh_add_device_late()
rather than being open-coded all over the place. Calling the function is
generally done immediately after calling eeh_add_device_late() anyway. This
is also a correctness fix since currently the sysfs files will be added
even if the EEH probe happens to fail.
Similarly, on pseries we currently add the sysfs files before calling
eeh_add_device_late(). This is flat-out broken since the sysfs files
require the pci_dev->dev.archdata.edev pointer to be set, and that is done
in eeh_add_device_late().
Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200306073904.4737-1-oohall@gmail.com
|
|
The previous commit reduced the amount of code that is run before we
setup a paca. However there are still a few remaining functions that
run with no paca, or worse, with an arbitrary value in r13 that will
be used as a paca pointer.
In particular the stack protector canary is stored in the paca, so if
stack protector is activated for any of these functions we will read
the stack canary from wherever r13 points. If r13 happens to point
outside of memory we will get a machine check / checkstop.
For example if we modify initialise_paca() to trigger stack
protection, and then boot in the mambo simulator with r13 poisoned in
skiboot before calling the kernel:
DEBUG: 19952232: (19952232): INSTRUCTION: PC=0xC0000000191FC1E8: [0x3C4C006D]: addis r2,r12,0x6D [fetch]
DEBUG: 19952236: (19952236): INSTRUCTION: PC=0xC00000001807EAD8: [0x7D8802A6]: mflr r12 [fetch]
FATAL ERROR: 19952276: (19952276): Check Stop for 0:0: Machine Check with ME bit of MSR off
DEBUG: 19952276: (19952276): INSTRUCTION: PC=0xC0000000191FCA7C: [0xE90D0CF8]: ld r8,0xCF8(r13) [Instruction Failed]
INFO: 19952276: (19952277): ** Execution stopped: Mambo Error, Machine Check Stop, **
systemsim % bt
pc: 0xC0000000191FCA7C initialise_paca+0x54
lr: 0xC0000000191FC22C early_setup+0x44
stack:0x00000000198CBED0 0x0 +0x0
stack:0x00000000198CBF00 0xC0000000191FC22C early_setup+0x44
stack:0x00000000198CBF90 0x1801C968 +0x1801C968
So annotate the relevant functions to ensure stack protection is never
enabled for them.
Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200320032116.1024773-2-mpe@ellerman.id.au
|
|
Currently we set up the paca after parsing the device tree for CPU
features. Prior to that, r13 contains random data, which means there
is random data in r13 while we're running the generic dt parsing code.
This random data varies depending on whether we boot through a vmlinux
or a zImage: for the vmlinux case it's usually around zero, but for
zImages we see random values like 912a72603d420015.
This is poor practice, and can also lead to difficult-to-debug
crashes. For example, when kcov is enabled, the kcov instrumentation
attempts to read preempt_count out of the current task, which goes via
the paca. This then crashes in the zImage case.
Similarly stack protector can cause crashes if r13 is bogus, by
reading from the stack canary in the paca.
To resolve this:
- move the paca setup to before the CPU feature parsing.
- because we no longer have access to CPU feature flags in paca
setup, change the HV feature test in the paca setup path to consider
the actual value of the MSR rather than the CPU feature.
Translations get switched on once we leave early_setup, so I think
we'd already catch any other cases where the paca or task aren't set
up.
Boot tested on a P9 guest and host.
Fixes: fb0b0a73b223 ("powerpc: Enable kcov")
Fixes: 06ec27aea9fc ("powerpc/64: add stack protector support")
Cc: stable@vger.kernel.org # v4.20+
Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Daniel Axtens <dja@axtens.net>
[mpe: Reword comments & change log a bit to mention stack protector]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200320032116.1024773-1-mpe@ellerman.id.au
|
|
entries
H_PAGE_THP_HUGE is used to differentiate between a THP hugepage and
hugetlb hugepage entries. The difference is WRT how we handle hash
fault on these address. THP address enables MPSS in segments. We want
to manage devmap hugepage entries similar to THP pt entries. Hence use
H_PAGE_THP_HUGE for devmap huge PTE entries.
With current code while handling hash PTE fault, we do set is_thp =
true when finding devmap PTE huge PTE entries.
Current code also does the below sequence we setting up huge devmap
entries.
entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
if (pfn_t_devmap(pfn))
entry = pmd_mkdevmap(entry);
In that case we would find both H_PAGE_THP_HUGE and PAGE_DEVMAP set
for huge devmap PTE entries. This results in false positive error like
below.
kernel BUG at /home/kvaneesh/src/linux/mm/memory.c:4321!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 56 PID: 67996 Comm: t_mmap_dio Not tainted 5.6.0-rc4-59640-g371c804dedbc #128
....
NIP [c00000000044c9e4] __follow_pte_pmd+0x264/0x900
LR [c0000000005d45f8] dax_writeback_one+0x1a8/0x740
Call Trace:
str_spec.74809+0x22ffb4/0x2d116c (unreliable)
dax_writeback_one+0x1a8/0x740
dax_writeback_mapping_range+0x26c/0x700
ext4_dax_writepages+0x150/0x5a0
do_writepages+0x68/0x180
__filemap_fdatawrite_range+0x138/0x180
file_write_and_wait_range+0xa4/0x110
ext4_sync_file+0x370/0x6e0
vfs_fsync_range+0x70/0xf0
sys_msync+0x220/0x2e0
system_call+0x5c/0x68
This is because our pmd_trans_huge check doesn't exclude _PAGE_DEVMAP.
To make this all consistent, update pmd_mkdevmap to set
H_PAGE_THP_HUGE and pmd_trans_huge check now excludes _PAGE_DEVMAP
correctly.
Fixes: ebd31197931d ("powerpc/mm: Add devmap support for ppc64")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200313094842.351830-1-aneesh.kumar@linux.ibm.com
|
|
Reorder Linux PTE bits to (almost) match Hash PTE bits.
RW Kernel : PP = 00
RO Kernel : PP = 00
RW User : PP = 01
RO User : PP = 11
So naturally, we should have
_PAGE_USER = 0x001
_PAGE_RW = 0x002
Today 0x001 and 0x002 and _PAGE_PRESENT and _PAGE_HASHPTE which
both are software only bits.
Switch _PAGE_USER and _PAGE_PRESET
Switch _PAGE_RW and _PAGE_HASHPTE
This allows to remove a few insns.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/c4d6c18a7f8d9d3b899bc492f55fbc40ef38896a.1583861325.git.christophe.leroy@c-s.fr
|
|
At the moment kasan_remap_early_shadow_ro() does nothing, because
k_end is 0 and k_cur < 0 is always true.
Change the test to k_cur != k_end, as done in
kasan_init_shadow_page_tables()
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Fixes: cbd18991e24f ("powerpc/mm: Fix an Oops in kasan_mmu_init()")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/4e7b56865e01569058914c991143f5961b5d4719.1583507333.git.christophe.leroy@c-s.fr
|
|
At the time being we have something like
if (something) {
p = get();
if (p) {
if (something_wrong)
goto out;
...
return;
} else if (a != b) {
if (some_error)
goto out;
...
}
goto out;
}
p = get();
if (!p) {
if (a != b) {
if (some_error)
goto out;
...
}
goto out;
}
This is similar to
p = get();
if (!p) {
if (a != b) {
if (some_error)
goto out;
...
}
goto out;
}
if (something) {
if (something_wrong)
goto out;
...
return;
}
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
[mpe: Reflow the comment that was moved]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/07a17425743600460ce35fa9432d42487a825583.1582099499.git.christophe.leroy@c-s.fr
|
|
We currently have two section mismatch warnings:
The function __boot_from_prom() references
the function __init prom_init().
The function start_here_common() references
the function __init start_kernel().
The warnings are correct, we do have branches from non-init code into
init code, which is freed after boot. But we don't expect to ever
execute any of that early boot code after boot, if we did that would
be a bug. In particular calling into OF after boot would be fatal
because OF is no longer resident.
So for now fix the warnings by marking the relevant functions as
__REF, which puts them in the ".ref.text" section.
This causes some reordering of the functions in the final link:
@@ -217,10 +217,9 @@
c00000000000b088 t generic_secondary_common_init
c00000000000b124 t __mmu_off
c00000000000b14c t __start_initialization_multiplatform
-c00000000000b1ac t __boot_from_prom
-c00000000000b1ec t __after_prom_start
-c00000000000b260 t p_end
-c00000000000b27c T copy_and_flush
+c00000000000b1ac t __after_prom_start
+c00000000000b220 t p_end
+c00000000000b23c T copy_and_flush
c00000000000b300 T __secondary_start
c00000000000b300 t copy_to_here
c00000000000b344 t start_secondary_prolog
@@ -228,8 +227,9 @@
c00000000000b36c t enable_64b_mode
c00000000000b388 T relative_toc
c00000000000b3a8 t p_toc
-c00000000000b3b0 t start_here_common
-c00000000000b3d0 t start_here_multiplatform
+c00000000000b3b0 t __boot_from_prom
+c00000000000b3f0 t start_here_multiplatform
+c00000000000b480 t start_here_common
c00000000000b880 T system_call_common
c00000000000b974 t system_call
c00000000000b9dc t system_call_exit
In particular __boot_from_prom moves after copy_to_here, which means
it's not copied to zero in the first stage of copy of the kernel to
zero.
But that's OK, because we only call __boot_from_prom before we do the
copy, so it makes no difference when it's copied. The call sequence
is:
__start
-> __start_initialization_multiplatform
-> __boot_from_prom
-> __start
-> __start_initialization_multiplatform
-> __after_prom_start
-> copy_and_flush
-> copy_and_flush (relocated to 0)
-> start_here_multiplatform
-> early_setup
Reported-by: Mauricio Faria de Oliveira <mauricfo@linux.ibm.com>
Reported-by: Roman Bolshakov <r.bolshakov@yadro.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200225031328.14676-1-mpe@ellerman.id.au
|
|
In xmon we have two variables that are used by the dump commands.
There's ndump which is the number of bytes to dump using 'd', and
nidump which is the number of instructions to dump using 'di'.
ndump starts as 64 and nidump starts as 16, but both can be set by the
user.
It's fairly common to be pasting addresses into xmon when trying to
debug something, and if you inadvertently double paste an address like
so:
0:mon> di c000000002101f6c c000000002101f6c
The second value is interpreted as the number of instructions to dump.
Luckily it doesn't dump 13 quintrillion instructions, the value is
limited to MAX_DUMP (128K). But as each instruction is dumped on a
single line, that's still a lot of output. If you're on a slow console
that can take multiple minutes to print. If you were "just popping in
and out of xmon quickly before the RCU/hardlockup detector fires" you
are now having a bad day.
Things are not as bad with 'd' because we print 16 bytes per line, so
it's fewer lines. But it's still quite a lot.
So shrink the maximum for 'd' to 64K (one page), which is 4096 lines.
For 'di' add a new limit which is the above / 4 - because instructions
are 4 bytes, meaning again we can dump one page.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200219110007.31195-1-mpe@ellerman.id.au
|
|
The "os-term" RTAS calls has one argument with a message address of OS
termination cause. rtas_os_term() already passes it but the recently
added prom_init's version of that missed it; it also does not fill
args correctly.
This passes the message address and initializes the number of arguments.
Fixes: 6a9c930bd775 ("powerpc/prom_init: Add the ESM call to prom_init")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200312074404.87293-1-aik@ozlabs.ru
|
|
request_irq() is preferred over setup_irq(). Invocations of setup_irq()
occur after memory allocators are ready.
Per tglx[1], setup_irq() existed in olden days when allocators were not
ready by the time early interrupts were initialized.
Hence replace setup_irq() by request_irq().
[1] https://lkml.kernel.org/r/alpine.DEB.2.20.1710191609480.1971@nanos
Signed-off-by: afzal mohammed <afzal.mohd.ma@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200312064256.18735-1-afzal.mohd.ma@gmail.com
|
|
Convert the various uses of fallthrough comments to fallthrough;
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/03073a9a269010ca439e9e658629c44602b0cc9f.1583896348.git.joe@perches.com
|
|
ld instruction should have 14 bit immediate field (DS) concatenated
with 0b00 on the right, encode it accordingly. Introduce macro
`IMM_DS()` to encode DS form instructions with 14 bit immediate field.
Fixes: 4ceae137bdab ("powerpc: emulate_step() tests for load/store instructions")
Reviewed-by: Sandipan Das <sandipan@linux.ibm.com>
Signed-off-by: Balamuruhan S <bala24@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200311102405.392263-1-bala24@linux.ibm.com
|
|
The expectation is that when calling of_read_drc_info_cell()
repeatedly to parse multiple drc-info records that the in/out curval
parameter points at the start of the next record on return. However,
the current behavior has curval still pointing at the final value of
the record just parsed. The result of which is that if the
ibm,drc-info property contains multiple properties the parsed value
of the drc_type for any record after the first has the power_domain
value of the previous record appended to the type string.
eg: observed the following 0xffffffff prepended to PHB
drc-info: type: \xff\xff\xff\xffPHB, prefix: PHB , index_start: 0x20000001
drc-info: suffix_start: 1, sequential_elems: 3072, sequential_inc: 1
drc-info: power-domain: 0xffffffff, last_index: 0x20000c00
In practice PHBs are the only type of connector in the ibm,drc-info
property that has multiple records. So, it breaks PHB hotplug, but by
chance not PCI, CPU, slot, or memory because they happen to only ever
be a single record.
Fix by incrementing curval past the power_domain value to point at
drc_type string of next record.
Fixes: e83636ac3334 ("pseries/drc-info: Search DRC properties for CPU indexes")
Signed-off-by: Tyrel Datwyler <tyreld@linux.ibm.com>
Acked-by: Nathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200307024547.5748-1-tyreld@linux.ibm.com
|
|
When the call to UV_REGISTER_MEM_SLOT is failing, for instance because
there is not enough free secured memory, the Hypervisor (HV) has to call
UV_RETURN to report the error to the Ultravisor (UV). Then the UV will call
H_SVM_INIT_ABORT to abort the securing phase and go back to the calling VM.
If the kvm->arch.secure_guest is not set, in the return path rfid is called
but there is no valid context to get back to the SVM since the Hcall has
been routed by the Ultravisor.
Move the setting of kvm->arch.secure_guest earlier in
kvmppc_h_svm_init_start() so in the return path, UV_RETURN will be called
instead of rfid.
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Tested-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The Hcall named H_SVM_* are reserved to the Ultravisor. However, nothing
prevent a malicious VM or SVM to call them. This could lead to weird result
and should be filtered out.
Checking the Secure bit of the calling MSR ensure that the call is coming
from either the Ultravisor or a SVM. But any system call made from a SVM
are going through the Ultravisor, and the Ultravisor should filter out
these malicious call. This way, only the Ultravisor is able to make such a
Hcall.
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: Ram Pai <linuxram@us.ibnm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
kvmppc_uvmem_init checks for Ultravisor support and returns early if
it is not present. Calling kvmppc_uvmem_free at module exit will cause
an Oops:
$ modprobe -r kvm-hv
Oops: Kernel access of bad area, sig: 11 [#1]
<snip>
NIP: c000000000789e90 LR: c000000000789e8c CTR: c000000000401030
REGS: c000003fa7bab9a0 TRAP: 0300 Not tainted (5.6.0-rc6-00033-g6c90b86a745a-dirty)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002282 XER: 00000000
CFAR: c000000000dae880 DAR: 0000000000000008 DSISR: 40000000 IRQMASK: 1
GPR00: c000000000789e8c c000003fa7babc30 c0000000016fe500 0000000000000000
GPR04: 0000000000000000 0000000000000006 0000000000000000 c000003faf205c00
GPR08: 0000000000000000 0000000000000001 000000008000002d c00800000ddde140
GPR12: c000000000401030 c000003ffffd9080 0000000000000001 0000000000000000
GPR16: 0000000000000000 0000000000000000 000000013aad0074 000000013aaac978
GPR20: 000000013aad0070 0000000000000000 00007fffd1b37158 0000000000000000
GPR24: 000000014fef0d58 0000000000000000 000000014fef0cf0 0000000000000001
GPR28: 0000000000000000 0000000000000000 c0000000018b2a60 0000000000000000
NIP [c000000000789e90] percpu_ref_kill_and_confirm+0x40/0x170
LR [c000000000789e8c] percpu_ref_kill_and_confirm+0x3c/0x170
Call Trace:
[c000003fa7babc30] [c000003faf2064d4] 0xc000003faf2064d4 (unreliable)
[c000003fa7babcb0] [c000000000400e8c] dev_pagemap_kill+0x6c/0x80
[c000003fa7babcd0] [c000000000401064] memunmap_pages+0x34/0x2f0
[c000003fa7babd50] [c00800000dddd548] kvmppc_uvmem_free+0x30/0x80 [kvm_hv]
[c000003fa7babd80] [c00800000ddcef18] kvmppc_book3s_exit_hv+0x20/0x78 [kvm_hv]
[c000003fa7babda0] [c0000000002084d0] sys_delete_module+0x1d0/0x2c0
[c000003fa7babe20] [c00000000000b9d0] system_call+0x5c/0x68
Instruction dump:
3fc2001b fb81ffe0 fba1ffe8 fbe1fff8 7c7f1b78 7c9c2378 3bde4560 7fc3f378
f8010010 f821ff81 486249a1 60000000 <e93f0008> 7c7d1b78 712a0002 40820084
---[ end trace 5774ef4dc2c98279 ]---
So this patch checks if kvmppc_uvmem_init actually allocated anything
before running kvmppc_uvmem_free.
Fixes: ca9f4942670c ("KVM: PPC: Book3S HV: Support for running secure guests")
Cc: stable@vger.kernel.org # v5.5+
Reported-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Tested-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The PS3 notification interrupt and kthread use a hacked up completion to
communicate. Since we're wanting to change the completion implementation and
this is abuse anyway, replace it with a simple rcuwait since there is only ever
the one waiter.
AFAICT the kthread uses TASK_INTERRUPTIBLE to not increase loadavg, kthreads
cannot receive signals by default and this one doesn't look different. Use
TASK_IDLE instead.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200321113241.930037873@linutronix.de
|
|
With PR KVM, shutting down a VM causes the host kernel to crash:
[ 314.219284] BUG: Unable to handle kernel data access on read at 0xc00800000176c638
[ 314.219299] Faulting instruction address: 0xc008000000d4ddb0
cpu 0x0: Vector: 300 (Data Access) at [c00000036da077a0]
pc: c008000000d4ddb0: kvmppc_mmu_pte_flush_all+0x68/0xd0 [kvm_pr]
lr: c008000000d4dd94: kvmppc_mmu_pte_flush_all+0x4c/0xd0 [kvm_pr]
sp: c00000036da07a30
msr: 900000010280b033
dar: c00800000176c638
dsisr: 40000000
current = 0xc00000036d4c0000
paca = 0xc000000001a00000 irqmask: 0x03 irq_happened: 0x01
pid = 1992, comm = qemu-system-ppc
Linux version 5.6.0-master-gku+ (greg@palmb) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #17 SMP Wed Mar 18 13:49:29 CET 2020
enter ? for help
[c00000036da07ab0] c008000000d4fbe0 kvmppc_mmu_destroy_pr+0x28/0x60 [kvm_pr]
[c00000036da07ae0] c0080000009eab8c kvmppc_mmu_destroy+0x34/0x50 [kvm]
[c00000036da07b00] c0080000009e50c0 kvm_arch_vcpu_destroy+0x108/0x140 [kvm]
[c00000036da07b30] c0080000009d1b50 kvm_vcpu_destroy+0x28/0x80 [kvm]
[c00000036da07b60] c0080000009e4434 kvm_arch_destroy_vm+0xbc/0x190 [kvm]
[c00000036da07ba0] c0080000009d9c2c kvm_put_kvm+0x1d4/0x3f0 [kvm]
[c00000036da07c00] c0080000009da760 kvm_vm_release+0x38/0x60 [kvm]
[c00000036da07c30] c000000000420be0 __fput+0xe0/0x310
[c00000036da07c90] c0000000001747a0 task_work_run+0x150/0x1c0
[c00000036da07cf0] c00000000014896c do_exit+0x44c/0xd00
[c00000036da07dc0] c0000000001492f4 do_group_exit+0x64/0xd0
[c00000036da07e00] c000000000149384 sys_exit_group+0x24/0x30
[c00000036da07e20] c00000000000b9d0 system_call+0x5c/0x68
This is caused by a use-after-free in kvmppc_mmu_pte_flush_all()
which dereferences vcpu->arch.book3s which was previously freed by
kvmppc_core_vcpu_free_pr(). This happens because kvmppc_mmu_destroy()
is called after kvmppc_core_vcpu_free() since commit ff030fdf5573
("KVM: PPC: Move kvm_vcpu_init() invocation to common code").
The kvmppc_mmu_destroy() helper calls one of the following depending
on the KVM backend:
- kvmppc_mmu_destroy_hv() which does nothing (Book3s HV)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 32-bit)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 64-bit)
- kvmppc_mmu_destroy_e500() which does nothing (BookE e500/e500mc)
It turns out that this is only relevant to PR KVM actually. And both
32 and 64 backends need vcpu->arch.book3s to be valid when calling
kvmppc_mmu_destroy_pr(). So instead of calling kvmppc_mmu_destroy()
from kvm_arch_vcpu_destroy(), call kvmppc_mmu_destroy_pr() at the
beginning of kvmppc_core_vcpu_free_pr(). This is consistent with
kvmppc_mmu_init() being the last call in kvmppc_core_vcpu_create_pr().
For the same reason, if kvmppc_core_vcpu_create_pr() returns an
error then this means that kvmppc_mmu_init() was either not called
or failed, in which case kvmppc_mmu_destroy() should not be called.
Drop the line in the error path of kvm_arch_vcpu_create().
Fixes: ff030fdf5573 ("KVM: PPC: Move kvm_vcpu_init() invocation to common code")
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/158455341029.178873.15248663726399374882.stgit@bahia.lan
|
|
Currently you can enable PPC_KUAP_DEBUG when PPC_KUAP is disabled,
even though the former has not effect without the latter.
Fix it so that PPC_KUAP_DEBUG can only be enabled when PPC_KUAP is
enabled, not when the platform could support KUAP (PPC_HAVE_KUAP).
Fixes: 890274c2dc4c ("powerpc/64s: Implement KUAP for Radix MMU")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20200301111738.22497-1-mpe@ellerman.id.au
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Simplify gen_btf logic to make it work with llvm-objcopy. The existing
'file format' and 'architecture' parsing logic is brittle and does not
work with llvm-objcopy/llvm-objdump.
'file format' output of llvm-objdump>=11 will match GNU objdump, but
'architecture' (bfdarch) may not.
.BTF in .tmp_vmlinux.btf is non-SHF_ALLOC. Add the SHF_ALLOC flag
because it is part of vmlinux image used for introspection. C code
can reference the section via linker script defined __start_BTF and
__stop_BTF. This fixes a small problem that previous .BTF had the
SHF_WRITE flag (objcopy -I binary -O elf* synthesized .data).
Additionally, `objcopy -I binary` synthesized symbols
_binary__btf_vmlinux_bin_start and _binary__btf_vmlinux_bin_stop (not
used elsewhere) are replaced with more commonplace __start_BTF and
__stop_BTF.
Add 2>/dev/null because GNU objcopy (but not llvm-objcopy) warns
"empty loadable segment detected at vaddr=0xffffffff81000000, is this intentional?"
We use a dd command to change the e_type field in the ELF header from
ET_EXEC to ET_REL so that lld will accept .btf.vmlinux.bin.o. Accepting
ET_EXEC as an input file is an extremely rare GNU ld feature that lld
does not intend to support, because this is error-prone.
The output section description .BTF in include/asm-generic/vmlinux.lds.h
avoids potential subtle orphan section placement issues and suppresses
--orphan-handling=warn warnings.
Fixes: df786c9b9476 ("bpf: Force .BTF section start to zero when dumping from vmlinux")
Fixes: cb0cc635c7a9 ("powerpc: Include .BTF section")
Reported-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Fangrui Song <maskray@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Stanislav Fomichev <sdf@google.com>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://github.com/ClangBuiltLinux/linux/issues/871
Link: https://lore.kernel.org/bpf/20200318222746.173648-1-maskray@google.com
|
|
These are only used by HV KVM and BookE, and in both cases they are
nops.
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This is only relevant to PR KVM. Make it obvious by moving the
function declaration to the Book3s header and rename it with
a _pr suffix.
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
With PR KVM, shutting down a VM causes the host kernel to crash:
[ 314.219284] BUG: Unable to handle kernel data access on read at 0xc00800000176c638
[ 314.219299] Faulting instruction address: 0xc008000000d4ddb0
cpu 0x0: Vector: 300 (Data Access) at [c00000036da077a0]
pc: c008000000d4ddb0: kvmppc_mmu_pte_flush_all+0x68/0xd0 [kvm_pr]
lr: c008000000d4dd94: kvmppc_mmu_pte_flush_all+0x4c/0xd0 [kvm_pr]
sp: c00000036da07a30
msr: 900000010280b033
dar: c00800000176c638
dsisr: 40000000
current = 0xc00000036d4c0000
paca = 0xc000000001a00000 irqmask: 0x03 irq_happened: 0x01
pid = 1992, comm = qemu-system-ppc
Linux version 5.6.0-master-gku+ (greg@palmb) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #17 SMP Wed Mar 18 13:49:29 CET 2020
enter ? for help
[c00000036da07ab0] c008000000d4fbe0 kvmppc_mmu_destroy_pr+0x28/0x60 [kvm_pr]
[c00000036da07ae0] c0080000009eab8c kvmppc_mmu_destroy+0x34/0x50 [kvm]
[c00000036da07b00] c0080000009e50c0 kvm_arch_vcpu_destroy+0x108/0x140 [kvm]
[c00000036da07b30] c0080000009d1b50 kvm_vcpu_destroy+0x28/0x80 [kvm]
[c00000036da07b60] c0080000009e4434 kvm_arch_destroy_vm+0xbc/0x190 [kvm]
[c00000036da07ba0] c0080000009d9c2c kvm_put_kvm+0x1d4/0x3f0 [kvm]
[c00000036da07c00] c0080000009da760 kvm_vm_release+0x38/0x60 [kvm]
[c00000036da07c30] c000000000420be0 __fput+0xe0/0x310
[c00000036da07c90] c0000000001747a0 task_work_run+0x150/0x1c0
[c00000036da07cf0] c00000000014896c do_exit+0x44c/0xd00
[c00000036da07dc0] c0000000001492f4 do_group_exit+0x64/0xd0
[c00000036da07e00] c000000000149384 sys_exit_group+0x24/0x30
[c00000036da07e20] c00000000000b9d0 system_call+0x5c/0x68
This is caused by a use-after-free in kvmppc_mmu_pte_flush_all()
which dereferences vcpu->arch.book3s which was previously freed by
kvmppc_core_vcpu_free_pr(). This happens because kvmppc_mmu_destroy()
is called after kvmppc_core_vcpu_free() since commit ff030fdf5573
("KVM: PPC: Move kvm_vcpu_init() invocation to common code").
The kvmppc_mmu_destroy() helper calls one of the following depending
on the KVM backend:
- kvmppc_mmu_destroy_hv() which does nothing (Book3s HV)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 32-bit)
- kvmppc_mmu_destroy_pr() which undoes the effects of
kvmppc_mmu_init() (Book3s PR 64-bit)
- kvmppc_mmu_destroy_e500() which does nothing (BookE e500/e500mc)
It turns out that this is only relevant to PR KVM actually. And both
32 and 64 backends need vcpu->arch.book3s to be valid when calling
kvmppc_mmu_destroy_pr(). So instead of calling kvmppc_mmu_destroy()
from kvm_arch_vcpu_destroy(), call kvmppc_mmu_destroy_pr() at the
beginning of kvmppc_core_vcpu_free_pr(). This is consistent with
kvmppc_mmu_init() being the last call in kvmppc_core_vcpu_create_pr().
For the same reason, if kvmppc_core_vcpu_create_pr() returns an
error then this means that kvmppc_mmu_init() was either not called
or failed, in which case kvmppc_mmu_destroy() should not be called.
Drop the line in the error path of kvm_arch_vcpu_create().
Fixes: ff030fdf5573 ("KVM: PPC: Move kvm_vcpu_init() invocation to common code")
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Convert the various uses of fallthrough comments to fallthrough;
Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe.com/
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The h_cede_tm kvm-unit-test currently fails when run inside an L1 guest
via the guest/nested hypervisor.
./run-tests.sh -v
...
TESTNAME=h_cede_tm TIMEOUT=90s ACCEL= ./powerpc/run powerpc/tm.elf -smp 2,threads=2 -machine cap-htm=on -append "h_cede_tm"
FAIL h_cede_tm (2 tests, 1 unexpected failures)
While the test relates to transactional memory instructions, the actual
failure is due to the return code of the H_CEDE hypercall, which is
reported as 224 instead of 0. This happens even when no TM instructions
are issued.
224 is the value placed in r3 to execute a hypercall for H_CEDE, and r3
is where the caller expects the return code to be placed upon return.
In the case of guest running under a nested hypervisor, issuing H_CEDE
causes a return from H_ENTER_NESTED. In this case H_CEDE is
specially-handled immediately rather than later in
kvmppc_pseries_do_hcall() as with most other hcalls, but we forget to
set the return code for the caller, hence why kvm-unit-test sees the
224 return code and reports an error.
Guest kernels generally don't check the return value of H_CEDE, so
that likely explains why this hasn't caused issues outside of
kvm-unit-tests so far.
Fix this by setting r3 to 0 after we finish processing the H_CEDE.
RHBZ: 1778556
Fixes: 4bad77799fed ("KVM: PPC: Book3S HV: Handle hypercalls correctly when nested")
Cc: linuxppc-dev@ozlabs.org
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
the valid ones
On P9 DD2.2 due to a CPU defect some TM instructions need to be emulated by
KVM. This is handled at first by the hardware raising a softpatch interrupt
when certain TM instructions that need KVM assistance are executed in the
guest. Althought some TM instructions per Power ISA are invalid forms they
can raise a softpatch interrupt too. For instance, 'tresume.' instruction
as defined in the ISA must have bit 31 set (1), but an instruction that
matches 'tresume.' PO and XO opcode fields but has bit 31 not set (0), like
0x7cfe9ddc, also raises a softpatch interrupt. Similarly for 'treclaim.'
and 'trechkpt.' instructions with bit 31 = 0, i.e. 0x7c00075c and
0x7c0007dc, respectively. Hence, if a code like the following is executed
in the guest it will raise a softpatch interrupt just like a 'tresume.'
when the TM facility is enabled ('tabort. 0' in the example is used only
to enable the TM facility):
int main() { asm("tabort. 0; .long 0x7cfe9ddc;"); }
Currently in such a case KVM throws a complete trace like:
[345523.705984] WARNING: CPU: 24 PID: 64413 at arch/powerpc/kvm/book3s_hv_tm.c:211 kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv]
[345523.705985] Modules linked in: kvm_hv(E) xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat
iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter bridge stp llc sch_fq_codel ipmi_powernv at24 vmx_crypto ipmi_devintf ipmi_msghandler
ibmpowernv uio_pdrv_genirq kvm opal_prd uio leds_powernv ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456
async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear tg3
crct10dif_vpmsum crc32c_vpmsum ipr [last unloaded: kvm_hv]
[345523.706030] CPU: 24 PID: 64413 Comm: CPU 0/KVM Tainted: G W E 5.5.0+ #1
[345523.706031] NIP: c0080000072cb9c0 LR: c0080000072b5e80 CTR: c0080000085c7850
[345523.706034] REGS: c000000399467680 TRAP: 0700 Tainted: G W E (5.5.0+)
[345523.706034] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 24022428 XER: 00000000
[345523.706042] CFAR: c0080000072b5e7c IRQMASK: 0
GPR00: c0080000072b5e80 c000000399467910 c0080000072db500 c000000375ccc720
GPR04: c000000375ccc720 00000003fbec0000 0000a10395dda5a6 0000000000000000
GPR08: 000000007cfe9ddc 7cfe9ddc000005dc 7cfe9ddc7c0005dc c0080000072cd530
GPR12: c0080000085c7850 c0000003fffeb800 0000000000000001 00007dfb737f0000
GPR16: c0002001edcca558 0000000000000000 0000000000000000 0000000000000001
GPR20: c000000001b21258 c0002001edcca558 0000000000000018 0000000000000000
GPR24: 0000000001000000 ffffffffffffffff 0000000000000001 0000000000001500
GPR28: c0002001edcc4278 c00000037dd80000 800000050280f033 c000000375ccc720
[345523.706062] NIP [c0080000072cb9c0] kvmhv_p9_tm_emulation+0x68/0x620 [kvm_hv]
[345523.706065] LR [c0080000072b5e80] kvmppc_handle_exit_hv.isra.53+0x3e8/0x798 [kvm_hv]
[345523.706066] Call Trace:
[345523.706069] [c000000399467910] [c000000399467940] 0xc000000399467940 (unreliable)
[345523.706071] [c000000399467950] [c000000399467980] 0xc000000399467980
[345523.706075] [c0000003994679f0] [c0080000072bd1c4] kvmhv_run_single_vcpu+0xa1c/0xb80 [kvm_hv]
[345523.706079] [c000000399467ac0] [c0080000072bd8e0] kvmppc_vcpu_run_hv+0x5b8/0xb00 [kvm_hv]
[345523.706087] [c000000399467b90] [c0080000085c93cc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[345523.706095] [c000000399467bb0] [c0080000085c582c] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[345523.706101] [c000000399467c40] [c0080000085b7498] kvm_vcpu_ioctl+0x3d0/0x7b0 [kvm]
[345523.706105] [c000000399467db0] [c0000000004adf9c] ksys_ioctl+0x13c/0x170
[345523.706107] [c000000399467e00] [c0000000004adff8] sys_ioctl+0x28/0x80
[345523.706111] [c000000399467e20] [c00000000000b278] system_call+0x5c/0x68
[345523.706112] Instruction dump:
[345523.706114] 419e0390 7f8a4840 409d0048 6d497c00 2f89075d 419e021c 6d497c00 2f8907dd
[345523.706119] 419e01c0 6d497c00 2f8905dd 419e00a4 <0fe00000> 38210040 38600000 ebc1fff0
and then treats the executed instruction as a 'nop'.
However the POWER9 User's Manual, in section "4.6.10 Book II Invalid
Forms", informs that for TM instructions bit 31 is in fact ignored, thus
for the TM-related invalid forms ignoring bit 31 and handling them like the
valid forms is an acceptable way to handle them. POWER8 behaves the same
way too.
This commit changes the handling of the cases here described by treating
the TM-related invalid forms that can generate a softpatch interrupt
just like their valid forms (w/ bit 31 = 1) instead of as a 'nop' and by
gently reporting any other unrecognized case to the host and treating it as
illegal instruction instead of throwing a trace and treating it as a 'nop'.
Signed-off-by: Gustavo Romero <gromero@linux.ibm.com>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
Acked-By: Michael Neuling <mikey@neuling.org>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
In kvmppc_unmap_free_pte() in book3s_64_mmu_radix.c, we use the
non-constant value PTE_INDEX_SIZE to clear a PTE page.
We can instead use the constant RADIX_PTE_INDEX_SIZE, because we know
this code will only be running when the Radix MMU is active.
Note that we already use RADIX_PTE_INDEX_SIZE for the allocation of
kvm_pte_cache.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Leonardo Bras <leonardo@linux.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This makes the same changes in the page fault handler for HPT guests
that commits 31c8b0d0694a ("KVM: PPC: Book3S HV: Use __gfn_to_pfn_memslot()
in page fault handler", 2018-03-01), 71d29f43b633 ("KVM: PPC: Book3S HV:
Don't use compound_order to determine host mapping size", 2018-09-11)
and 6579804c4317 ("KVM: PPC: Book3S HV: Avoid crash from THP collapse
during radix page fault", 2018-10-04) made for the page fault handler
for radix guests.
In summary, where we used to call get_user_pages_fast() and then do
special handling for VM_PFNMAP vmas, we now call __get_user_pages_fast()
and then __gfn_to_pfn_memslot() if that fails, followed by reading the
Linux PTE to get the host PFN, host page size and mapping attributes.
This also brings in the change from SetPageDirty() to set_page_dirty_lock()
which was done for the radix page fault handler in commit c3856aeb2940
("KVM: PPC: Book3S HV: Fix handling of large pages in radix page fault
handler", 2018-02-23).
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The NDD_ALIASING flag is used to indicate where pmem capacity might
alias with blk capacity and require labeling. It is also used to
indicate whether the DIMM supports labeling. Separate this latter
capability into its own flag so that the NDD_ALIASING flag is scoped to
true aliased configurations.
To my knowledge aliased configurations only exist in the ACPI spec,
there are no known platforms that ship this support in production.
This clarity allows namespace-capacity alignment constraints around
interleave-ways to be relaxed.
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Link: https://lore.kernel.org/r/158041477856.3889308.4212605617834097674.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|