Age | Commit message (Collapse) | Author | Files | Lines |
|
The powernv PCI code stores NPU data in the pnv_phb struct. The latter
is referenced by pci_controller::private_data. We are going to have NPU2
support in the pseries platform as well but it does not store any
private_data in in the pci_controller struct; and even if it did,
it would be a different data structure.
This makes npu a pointer and stores it one level higher in
the pci_controller struct.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This new memory does not have page structs as it is not plugged to
the host so gup() will fail anyway.
This adds 2 helpers:
- mm_iommu_newdev() to preregister the "memory device" memory so
the rest of API can still be used;
- mm_iommu_is_devmem() to know if the physical address is one of thise
new regions which we must avoid unpinning of.
This adds @mm to tce_page_is_contained() and iommu_tce_xchg() to test
if the memory is device memory to avoid pfn_to_page().
This adds a check for device memory in mm_iommu_ua_mark_dirty_rm() which
does delayed pages dirtying.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Normally mm_iommu_get() should add a reference and mm_iommu_put() should
remove it. However historically mm_iommu_find() does the referencing and
mm_iommu_get() is doing allocation and referencing.
We are going to add another helper to preregister device memory so
instead of having mm_iommu_new() (which pre-registers the normal memory
and references the region), we need separate helpers for pre-registering
and referencing.
This renames:
- mm_iommu_get to mm_iommu_new;
- mm_iommu_find to mm_iommu_get.
This changes mm_iommu_get() to reference the region so the name now
reflects what it does.
This removes the check for exact match from mm_iommu_new() as we want it
to fail on existing regions; mm_iommu_get() should be used instead.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The skiboot firmware has a hot reset handler which fences the NVIDIA V100
GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs:
https://github.com/open-power/skiboot/commit/fca2b2b839a67
Now we are going to pass V100 via VFIO which most certainly involves
KVM guests which are often terminated without getting a chance to offline
GPU RAM so we end up with a running machine with misconfigured memory.
Accessing this memory produces hardware management interrupts (HMI)
which bring the host down.
To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable()
via pci_disable_device() which switches NPU2 to a safe mode and prevents
HMIs.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
On the 8xx, no-execute is set via PPP bits in the PTE. Therefore
a no-exec fault generates DSISR_PROTFAULT error bits,
not DSISR_NOEXEC_OR_G.
This patch adds DSISR_PROTFAULT in the test mask.
Fixes: d3ca587404b3 ("powerpc/mm: Fix reporting of kernel execute faults")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
System call table generation script must be run to gener-
ate unistd_32/64.h and syscall_table_32/64/c32/spu.h files.
This patch will have changes which will invokes the script.
This patch will generate unistd_32/64.h and syscall_table-
_32/64/c32/spu.h files by the syscall table generation
script invoked by parisc/Makefile and the generated files
against the removed files must be identical.
The generated uapi header file will be included in uapi/-
asm/unistd.h and generated system call table header file
will be included by kernel/systbl.S file.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The system call tables are in different format in all
architecture and it will be difficult to manually add or
modify the system calls in the respective files. To make
it easy by keeping a script and which will generate the
uapi header and syscall table file. This change will also
help to unify the implementation across all architectures.
The system call table generation script is added in
syscalls directory which contain the script to generate
both uapi header file and system call table files.
The syscall.tbl file will be the input for the scripts.
syscall.tbl contains the list of available system calls
along with system call number and corresponding entry point.
Add a new system call in this architecture will be possible
by adding new entry in the syscall.tbl file.
Adding a new table entry consisting of:
- System call number.
- ABI.
- System call name.
- Entry point name.
- Compat entry name, if required.
syscallhdr.sh and syscalltbl.sh will generate uapi header-
unistd_32/64.h and syscall_table_32/64/c32/spu.h files
respectively. File syscall_table_32/64/c32/spu.h is incl-
uded by syscall.S - the real system call table. Both *.sh
files will parse the content syscall.tbl to generate the
header and table files.
ARM, s390 and x86 architecuture does have similar support.
I leverage their implementation to come up with a generic
solution.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
PowerPC uses a syscall table with native and compat calls
interleaved, which is a slightly simpler way to define two
matching tables.
As we move to having the tables generated, that advantage
is no longer important, but the interleaved table gets in
the way of using the same scripts as on the other archit-
ectures.
Split out a new compat_sys_call_table symbol that contains
all the compat calls, and leave the main table for the nat-
ive calls, to more closely match the method we use every-
where else.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Move the macro definition for compat_sys_sigsuspend from
asm/systbl.h to the file which it is getting included.
One of the patch in this patch series is generating uapi
header and syscall table files. In order to come up with
a common implimentation across all architecture, we need
to do this change.
This change will simplify the implementation of system
call table generation script and help to come up a common
implementation across all architecture.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
NR_syscalls macro holds the number of system call exist
in powerpc architecture. We have to change the value of
NR_syscalls, if we add or delete a system call.
One of the patch in this patch series has a script which
will generate a uapi header based on syscall.tbl file.
The syscall.tbl file contains the number of system call
information. So we have two option to update NR_syscalls
value.
1. Update NR_syscalls in asm/unistd.h manually by count-
ing the no.of system calls. No need to update NR_sys-
calls until we either add a new system call or delete
existing system call.
2. We can keep this feature in above mentioned script,
that will count the number of syscalls and keep it in
a generated file. In this case we don't need to expli-
citly update NR_syscalls in asm/unistd.h file.
The 2nd option will be the recommended one. For that, I
added the __NR_syscalls macro in uapi/asm/unistd.h along
with NR_syscalls asm/unistd.h. The macro __NR_syscalls
also added for making the name convention same across all
architecture. While __NR_syscalls isn't strictly part of
the uapi, having it as part of the generated header to
simplifies the implementation. We also need to enclose
this macro with #ifdef __KERNEL__ to avoid side effects.
Signed-off-by: Firoz Khan <firoz.khan@linaro.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Protection key tracking information is not copied over to the
mm_struct of the child during fork(). This can cause the child to
erroneously allocate keys that were already allocated. Any allocated
execute-only key is lost aswell.
Add code; called by dup_mmap(), to copy the pkey state from parent to
child explicitly.
This problem was originally found by Dave Hansen on x86, which turns
out to be a problem on powerpc aswell.
Fixes: cf43d3b26452 ("powerpc: Enable pkey subsystem")
Cc: stable@vger.kernel.org # v4.16+
Reviewed-by: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The AFU Descriptor Template in the PCI config space has a Name Space
field which is a 24 Byte ASCII character string of descriptive name
space for the AFU. The OCXL driver read the string four characters at
a time with pci_read_config_dword().
This optimization is valid on a little-endian system since this is PCI,
but a big-endian system ends up with each subset of four characters in
reverse order.
This could be fixed by switching to read characters one by one. Another
option is to swap the bytes if we're big-endian.
Go for the latter with le32_to_cpu().
Cc: stable@vger.kernel.org # v4.16
Signed-off-by: Greg Kurz <groug@kaod.org>
Acked-by: Frederic Barrat <fbarrat@linux.ibm.com>
Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
This is a new test case that creates a signal and starts a suspended
transaction inside the signal handler.
It returns from the signal handler with the CPU at suspended state, but
without setting user context MSR Transaction State (TS) field.
The kernel signal handler code should be able to handle this discrepancy
instead of crashing.
This code could be compiled and used to test 32 and 64-bits signal
handlers.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
There is a TM Bad Thing bug that can be caused when you return from a
signal context in a suspended transaction but with ucontext MSR[TS] unset.
This forces regs->msr[TS] to be set at syscall entrance (since the CPU
state is transactional). It also calls treclaim() to flush the transaction
state, which is done based on the live (mfmsr) MSR state.
Since user context MSR[TS] is not set, then restore_tm_sigcontexts() is not
called, thus, not executing recheckpoint, keeping the CPU state as not
transactional. When calling rfid, SRR1 will have MSR[TS] set, but the CPU
state is non transactional, causing the TM Bad Thing with the following
stack:
[ 33.862316] Bad kernel stack pointer 3fffd9dce3e0 at c00000000000c47c
cpu 0x8: Vector: 700 (Program Check) at [c00000003ff7fd40]
pc: c00000000000c47c: fast_exception_return+0xac/0xb4
lr: 00003fff865f442c
sp: 3fffd9dce3e0
msr: 8000000102a03031
current = 0xc00000041f68b700
paca = 0xc00000000fb84800 softe: 0 irq_happened: 0x01
pid = 1721, comm = tm-signal-sigre
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
WARNING: exception is not recoverable, can't continue
The same problem happens on 32-bits signal handler, and the fix is very
similar, if tm_recheckpoint() is not executed, then regs->msr[TS] should be
zeroed.
This patch also fixes a sparse warning related to lack of indentation when
CONFIG_PPC_TRANSACTIONAL_MEM is set.
Fixes: 2b0a576d15e0e ("powerpc: Add new transactional memory state to the signal context")
CC: Stable <stable@vger.kernel.org> # 3.10+
Signed-off-by: Breno Leitao <leitao@debian.org>
Tested-by: Michal Suchánek <msuchanek@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Usually a TM Bad Thing exception is raised due to three different problems.
a) touching SPRs in an active transaction; b) using TM instruction with the
facility disabled and c) setting a wrong MSR/SRR1 at RFID.
The two initial cases are easy to identify by looking at the instructions.
The latter case is harder, because the MSR is masked after RFID, so, it is
very useful to look at the previous MSR (SRR1) before RFID as also the
current and masked MSR.
Since MSR is saved at paca just before RFID, this patch prints it if a TM
Bad thing happen, helping to understand what is the invalid TM transition
that is causing the exception.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
As other exit points, move SRR1 (MSR) into paca->tm_scratch, so, if
there is a TM Bad Thing in RFID, it is easy to understand what was the
SRR1 value being used.
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
On a signal handler return, the user could set a context with MSR[TS] bits
set, and these bits would be copied to task regs->msr.
At restore_tm_sigcontexts(), after current task regs->msr[TS] bits are set,
several __get_user() are called and then a recheckpoint is executed.
This is a problem since a page fault (in kernel space) could happen when
calling __get_user(). If it happens, the process MSR[TS] bits were
already set, but recheckpoint was not executed, and SPRs are still invalid.
The page fault can cause the current process to be de-scheduled, with
MSR[TS] active and without tm_recheckpoint() being called. More
importantly, without TEXASR[FS] bit set also.
Since TEXASR might not have the FS bit set, and when the process is
scheduled back, it will try to reclaim, which will be aborted because of
the CPU is not in the suspended state, and, then, recheckpoint. This
recheckpoint will restore thread->texasr into TEXASR SPR, which might be
zero, hitting a BUG_ON().
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
cpu 0xb: Vector: 700 (Program Check) at [c00000041f1576d0]
pc: c000000000054550: restore_gprs+0xb0/0x180
lr: 0000000000000000
sp: c00000041f157950
msr: 8000000100021033
current = 0xc00000041f143000
paca = 0xc00000000fb86300 softe: 0 irq_happened: 0x01
pid = 1021, comm = kworker/11:1
kernel BUG at /build/linux-sf3Co9/linux-4.9.30/arch/powerpc/kernel/tm.S:434!
Linux version 4.9.0-3-powerpc64le (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.30-2+deb9u2 (2017-06-26)
enter ? for help
[c00000041f157b30] c00000000001bc3c tm_recheckpoint.part.11+0x6c/0xa0
[c00000041f157b70] c00000000001d184 __switch_to+0x1e4/0x4c0
[c00000041f157bd0] c00000000082eeb8 __schedule+0x2f8/0x990
[c00000041f157cb0] c00000000082f598 schedule+0x48/0xc0
[c00000041f157ce0] c0000000000f0d28 worker_thread+0x148/0x610
[c00000041f157d80] c0000000000f96b0 kthread+0x120/0x140
[c00000041f157e30] c00000000000c0e0 ret_from_kernel_thread+0x5c/0x7c
This patch simply delays the MSR[TS] set, so, if there is any page fault in
the __get_user() section, it does not have regs->msr[TS] set, since the TM
structures are still invalid, thus avoiding doing TM operations for
in-kernel exceptions and possible process reschedule.
With this patch, the MSR[TS] will only be set just before recheckpointing
and setting TEXASR[FS] = 1, thus avoiding an interrupt with TM registers in
invalid state.
Other than that, if CONFIG_PREEMPT is set, there might be a preemption just
after setting MSR[TS] and before tm_recheckpoint(), thus, this block must
be atomic from a preemption perspective, thus, calling
preempt_disable/enable() on this code.
It is not possible to move tm_recheckpoint to happen earlier, because it is
required to get the checkpointed registers from userspace, with
__get_user(), thus, the only way to avoid this undesired behavior is
delaying the MSR[TS] set.
The 32-bits signal handler seems to be safe this current issue, but, it
might be exposed to the preemption issue, thus, disabling preemption in
this chunk of code.
Changes from v2:
* Run the critical section with preempt_disable.
Fixes: 87b4e5393af7 ("powerpc/tm: Fix return of active 64bit signals")
Cc: stable@vger.kernel.org (v3.9+)
Signed-off-by: Breno Leitao <leitao@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The rc bits contained in ptes are used to track whether a page has been
accessed and whether it is dirty. The accessed bit is used to age a page
and the dirty bit to track whether a page is dirty or not.
Now that we support nested guests there are three ptes which track the
state of the same page:
- The partition-scoped page table in the L1 guest, mapping L2->L1 address
- The partition-scoped page table in the host for the L1 guest, mapping
L1->L0 address
- The shadow partition-scoped page table for the nested guest in the host,
mapping L2->L0 address
The idea is to attempt to keep the rc state of these three ptes in sync,
both when setting and when clearing rc bits.
When setting the bits we achieve consistency by:
- Initially setting the bits in the shadow page table as the 'and' of the
other two.
- When updating in software the rc bits in the shadow page table we
ensure the state is consistent with the other two locations first, and
update these before reflecting the change into the shadow page table.
i.e. only set the bits in the L2->L0 pte if also set in both the
L2->L1 and the L1->L0 pte.
When clearing the bits we achieve consistency by:
- The rc bits in the shadow page table are only cleared when discarding
a pte, and we don't need to record this as if either bit is set then
it must also be set in the pte mapping L1->L0.
- When L1 clears an rc bit in the L2->L1 mapping it __should__ issue a
tlbie instruction
- This means we will discard the pte from the shadow page table
meaning the mapping will have to be setup again.
- When setup the pte again in the shadow page table we will ensure
consistency with the L2->L1 pte.
- When the host clears an rc bit in the L1->L0 mapping we need to also
clear the bit in any ptes in the shadow page table which map the same
gfn so we will be notified if a nested guest accesses the page.
This case is what this patch specifically concerns.
- We can search the nest_rmap list for that given gfn and clear the
same bit from all corresponding ptes in shadow page tables.
- If a nested guest causes either of the rc bits to be set by software
in future then we will update the L1->L0 pte and maintain consistency.
With the process outlined above we aim to maintain consistency of the 3
pte locations where we track rc for a given guest page.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Introduce a function kvmhv_update_nest_rmap_rc_list() which for a given
nest_rmap list will traverse it, find the corresponding pte in the shadow
page tables, and if it still maps the same host page update the rc bits
accordingly.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
The shadow page table contains ptes for translations from nested guest
address to host address. Currently when creating these ptes we take the
rc bits from the pte for the L1 guest address to host address
translation. This is incorrect as we must also factor in the rc bits
from the pte for the nested guest address to L1 guest address
translation (as contained in the L1 guest partition table for the nested
guest).
By not calculating these bits correctly L1 may not have been correctly
notified when it needed to update its rc bits in the partition table it
maintains for its nested guest.
Modify the code so that the rc bits in the resultant pte for the L2->L0
translation are the 'and' of the rc bits in the L2->L1 pte and the L1->L0
pte, also accounting for whether this was a write access when setting
the dirty bit.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Nested rmap entries are used to store the translation from L1 gpa to L2
gpa when entries are inserted into the shadow (nested) page tables. This
rmap list is located by indexing the rmap array in the memslot by L1
gfn. When we come to search for these entries we only know the L1 page size
(which could be PAGE_SIZE, 2M or a 1G page) and so can only select a gfn
aligned to that size. This means that when we insert the entry, so we can
find it later, we need to align the gfn we use to select the rmap list
in which to insert the entry to L1 page size as well.
By not doing this we were missing nested rmap entries when modifying L1
ptes which were for a page also passed through to an L2 guest.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
We already hold the kvm->mmu_lock spin lock across updating the rc bits
in the pte for the L1 guest. Continue to hold the lock across updating
the rc bits in the pte for the nested guest as well to prevent
invalidations from occurring.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Alexei reported use after frees in inet_diag_dump_icsk() [1]
Because we use refcount_set() when various sockets are setup and
inserted into ehash, we also need to make sure inet_diag_dump_icsk()
wont race with the refcount_set() operations.
Jonathan Lemon sent a patch changing net_twsk_hashdance() but
other spots would need risky changes.
Instead, fix inet_diag_dump_icsk() as this bug came with
linux-4.10 only.
[1] Quoting Alexei :
First something iterating over sockets finds already freed tw socket:
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 2 PID: 2738 at lib/refcount.c:153 refcount_inc+0x26/0x30
RIP: 0010:refcount_inc+0x26/0x30
RSP: 0018:ffffc90004c8fbc0 EFLAGS: 00010282
RAX: 000000000000002b RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff88085ee9d680 RSI: ffff88085ee954c8 RDI: ffff88085ee954c8
RBP: ffff88010ecbd2c0 R08: 0000000000000000 R09: 000000000000174c
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8806ba9bf210 R14: ffffffff82304600 R15: ffff88010ecbd328
FS: 00007f81f5a7d700(0000) GS:ffff88085ee80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f81e2a95000 CR3: 000000069b2eb006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
inet_diag_dump_icsk+0x2b3/0x4e0 [inet_diag] // sock_hold(sk); in net/ipv4/inet_diag.c:1002
? kmalloc_large_node+0x37/0x70
? __kmalloc_node_track_caller+0x1cb/0x260
? __alloc_skb+0x72/0x1b0
? __kmalloc_reserve.isra.40+0x2e/0x80
__inet_diag_dump+0x3b/0x80 [inet_diag]
netlink_dump+0x116/0x2a0
netlink_recvmsg+0x205/0x3c0
sock_read_iter+0x89/0xd0
__vfs_read+0xf7/0x140
vfs_read+0x8a/0x140
SyS_read+0x3f/0xa0
do_syscall_64+0x5a/0x100
then a minute later twsk timer fires and hits two bad refcnts
for this freed socket:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:228 refcount_dec+0x2e/0x40
Modules linked in:
RIP: 0010:refcount_dec+0x2e/0x40
RSP: 0018:ffff88085f5c3ea8 EFLAGS: 00010296
RAX: 000000000000002c RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffffc90003c77280 R08: 0000000000000000 R09: 00000000000017d3
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffffffff82ad2d80
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
inet_twsk_kill+0x9d/0xc0 // inet_twsk_bind_unhash(tw, hashinfo);
call_timer_fn+0x29/0x110
run_timer_softirq+0x36b/0x3a0
refcount_t: underflow; use-after-free.
WARNING: CPU: 31 PID: 0 at lib/refcount.c:187 refcount_sub_and_test+0x46/0x50
RIP: 0010:refcount_sub_and_test+0x46/0x50
RSP: 0018:ffff88085f5c3eb8 EFLAGS: 00010296
RAX: 0000000000000026 RBX: ffff88010ecbd2c0 RCX: 000000000000083f
RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000003f
RBP: ffff88010ecbd358 R08: 0000000000000000 R09: 000000000000185b
R10: ffffffff81e7c5a0 R11: 0000000000000000 R12: ffff88010ecbd358
R13: ffffffff8182de00 R14: ffff88085f5c3ef8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88085f5c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbe42685250 CR3: 0000000002209001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
inet_twsk_put+0x12/0x20 // inet_twsk_put(tw);
call_timer_fn+0x29/0x110
run_timer_softirq+0x36b/0x3a0
Fixes: 67db3e4bfbc9 ("tcp: no longer hold ehash lock while calling tcp_get_info()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Arjun Vynipadath will be taking over as maintainer from now.
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
For fadump to work successfully there should not be any holes in reserved
memory ranges where kernel has asked firmware to move the content of old
kernel memory in event of crash. Now that fadump uses CMA for reserved
area, this memory area is now not protected from hot-remove operations
unless it is cma allocated. Hence, fadump service can fail to re-register
after the hot-remove operation, if hot-removed memory belongs to fadump
reserved region. To avoid this make sure that memory from fadump reserved
area is not hot-removable if fadump is registered.
However, if user still wants to remove that memory, he can do so by
manually stopping fadump service before hot-remove operation.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
fadump fails to register when there are holes in reserved memory area.
This can happen if user has hot-removed a memory that falls in the
fadump reserved memory area. Throw a meaningful error message to the
user in such case.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[mpe: is_reserved_memory_area_contiguous() returns bool, unsplit string]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.
In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.
In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".
We take advantage of this default and the Linux kernel's Contiguous
Memory Allocator (CMA) to fundamentally change the memory reservation
model for fadump.
Instead of setting aside a significant chunk of memory nobody can use,
this patch uses CMA instead, to reserve a significant chunk of memory
that the kernel is prevented from using (due to MIGRATE_CMA), but
applications are free to use it. With this fadump will still be able
to capture all of the kernel memory and most of the user space memory
except the user pages that were present in CMA region.
Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 7557 193 6822 12 541 6725
Swap: 4095 0 4095
With this patch:
[root@zzxx-yy10 ~]# free -m
total used free shared buff/cache available
Mem: 8133 194 7464 12 475 7338
Swap: 4095 0 4095
Changes made here are completely transparent to how fadump has
traditionally worked.
Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
CMA and its usage.
TODO:
- Handle case where CMA reservation spans nodes.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
opal_power_control_init() depends on opal message notifier to be
initialized, which is done in opal_init()->opal_message_init(). But both
these initialization are called through machine initcalls and it all
depends on in which order they being called. So far these are called in
correct order (may be we got lucky) and never saw any issue. But it is
clearer to control initialization order explicitly by moving
opal_power_control_init() into opal_init().
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
The script "checkpatch.pl" pointed information out like the following.
WARNING: void function return statements are not generally useful
Thus remove such a statement in the affected functions.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Omit an extra message for a memory allocation failure in these
functions.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
A single character (line break) should be put into a sequence.
Thus use the corresponding function "seq_putc".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Some data were printed into a sequence by four separate function calls.
Print the same data by two single function calls instead.
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
CONFIG_EARLY_DEBUG_CPM requires IMMR area TLB to be pinned
otherwise it doesn't survive MMU_init, and the boot fails.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
CONFIG_PCI_MSI was made mandatory by commit a311e738b6d8
("powerpc/powernv: Make PCI non-optional") so the #ifdef
checks around CONFIG_PCI_MSI here can be removed entirely.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Fix a spelling mistake in a register description.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
Commit 14c63f17b1fde ("perf: Drop sample rate when sampling is too
slow") introduced a way to throttle PMU interrupts if we're spending
too much time just processing those. Wire up powerpc PMI handler to
use this infrastructure.
We have throttling of the *rate* of interrupts, but this adds
throttling based on the *time taken* to process the interrupts.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|
|
It was reported that IPsec would crash when it encounters an IPv6
reassembled packet because skb->sk is non-zero and not a valid
pointer.
This is because skb->sk is now a union with ip_defrag_offset.
This patch fixes this by resetting skb->sk when exiting from
the reassembly code.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 219badfaade9 ("ipv6: frags: get rid of ip6frag_skb_cb/...")
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The MAC table in Ocelot supports auto aging (normal) and static entries.
MAC entries that is manually configured should be static and not subject
to aging.
Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support")
Signed-off-by: Allan Nielsen <allan.nielsen@microchip.com>
Reviewed-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C has a MAINTAINERS update for you, so people will be immediately
pointed to the right person for this previously orphaned driver.
And one of Arnd's build warning fixes for a new driver added this
cycle"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: nvidia-gpu: mark resume function as __maybe_unused
MAINTAINERS: add entry for i2c-axxia driver
|
|
Pull UBI/UBIFS fixes from Richard Weinberger:
- Kconfig dependency fixes for our new auth feature
- Fix for selecting the right compressor when creating a fs
- Bugfix for a bug in UBIFS's O_TMPFILE implementation
- Refcounting fixes for UBI
* tag 'upstream-4.20-rc7' of git://git.infradead.org/linux-ubifs:
ubifs: Handle re-linking of inodes correctly while recovery
ubi: Do not drop UBI device reference before using
ubi: Put MTD device after it is not used
ubifs: Fix default compression selection in ubifs
ubifs: Fix memory leak on error condition
ubifs: auth: Add CONFIG_KEYS dependency
ubifs: CONFIG_UBIFS_FS_AUTHENTICATION should depend on UBIFS_FS
ubifs: replay: Fix high stack usage
|
|
Clang warns:
drivers/acpi/tables.c:715:14: warning: unused variable 'amlcode'
[-Wunused-variable]
static void *amlcode __attribute__ ((weakref("AmlCode")));
^
drivers/acpi/tables.c:716:14: warning: unused variable 'dsdt_amlcode'
[-Wunused-variable]
static void *dsdt_amlcode __attribute__ ((weakref("dsdt_aml_code")));
^
2 warnings generated.
The only uses of these variables are hiddem behind CONFIG_ACPI_CUSTOM_DSDT
so do the same thing here.
Fixes: 82e4eb4e9653 (ACPI / tables: add DSDT AmlCode new declaration name support)
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
In __ghes_panic() clear the block status in the APEI generic
error status block for that generic hardware error source before
calling panic() to prevent a second panic() in the crash kernel
for exactly the same fatal error.
Otherwise ghes_probe(), running in the crash kernel, would see
an unhandled error in the APEI generic error status block and
panic again, thereby precluding any crash dump.
Signed-off-by: Lenny Szubowicz <lszubowi@redhat.com>
Signed-off-by: David Arcari <darcari@redhat.com>
Tested-by: Tyler Baicar <baicar.tyler@gmail.com>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Mapping the delay slot emulation page as both writeable & executable
presents a security risk, in that if an exploit can write to & jump into
the page then it can be used as an easy way to execute arbitrary code.
Prevent this by mapping the page read-only for userland, and using
access_process_vm() with the FOLL_FORCE flag to write to it from
mips_dsemul().
This will likely be less efficient due to copy_to_user_page() performing
cache maintenance on a whole page, rather than a single line as in the
previous use of flush_cache_sigtramp(). However this delay slot
emulation code ought not to be running in any performance critical paths
anyway so this isn't really a problem, and we can probably do better in
copy_to_user_page() anyway in future.
A major advantage of this approach is that the fix is small & simple to
backport to stable kernels.
Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paul Burton <paul.burton@mips.com>
Fixes: 432c6bacbd0c ("MIPS: Use per-mm page to execute branch delay slot instructions")
Cc: stable@vger.kernel.org # v4.8+
Cc: linux-mips@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Rich Felker <dalias@libc.org>
Cc: David Daney <david.daney@cavium.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
- Implement BPF based syscall filtering in 'perf trace', using BPF maps and
the augmented_raw_syscalls.c BPF proggie (Arnaldo Carvalho de Melo)
- Allow specifying in .perfconfig a set of events use in 'perf trace' in
addition to any other specified from the command line. This initially
will be used to always use the augmented_raw_syscalls.o precompiled
BPF program for getting pointer contents. (Arnaldo Carvalho de Melo)
- Allow fine grained control about how the syscall output should be
formatted. This will be used to allow producing the same output produced
by the 'strace' tool, to then use in regression tests comparing the
output of 'perf trace' with the one produced from 'strace' (Arnaldo Carvalho de Melo)
- Beautify the renameat2 olddirfd, newdirfd and flags arguments (Arnaldo Carvalho de Melo)
- Beautify arch_prctl 'code' syscall arg (Arnaldo Carvalho de Melo)
- Beautify fadvise64 'advice' syscall arg (Arnaldo Carvalho de Melo)
- Relax checks on perf-PID.map ownership, resulting in symbols in
executable anonymous maps setup by JITs in things like node.js to
be resolved in a 'perf top' session run by root without the need
for --force to be used (Arnaldo Carvalho de Melo)
- Update asm-generic/unistd.h copy (Arnaldo Carvalho de Melo)
- Do not use the first and last symbols when setting up address filters in
auxtrace, this fails when we don't have a symbol table, filter the entire
area based on the dso size. (Adrian Hunter)
- Do not use kernel headers to build libsubcmd, we shouldn't use
anything from outside tools/, fixes the build with the Android NDK (Arnaldo Carvalho de Melo)
- Add several prototypes for systems lacking those, such as open_memstream(),
sigqueue(), fixing warnings building with Android's bionic libc that were
preventing the use of -Werror there (Arnaldo Carvalho de Melo)
- Use LDFLAGS in the libtraceevent build commands, allowing developers
to override its values (Jiri Olsa)
- Link libperf-jvmti.so with LDFLAGS variable, allowing distro
packages to propagate its settings when building this library (Jiri Olsa)
- cs-etm (ARM CoreSight) fixes: (Leo Yan)
- Correct packets swapping in cs_etm__flush()
- Avoid stale branch samples when flush packet
- Remove unused 'trace_on' in cs_etm_decoder
- Refactor enumeration cs_etm_sample_type
- Rename CS_ETM_TRACE_ON to CS_ETM_DISCONTINUITY
- Treat NO_SYNC element as trace discontinuity
- Treat EO_TRACE element as trace discontinuity
- Generate branch sample for exception packet
- Use shebangs in the 'perf test' shell scripts, making them identifiable as
shell scripts (Michael Petlan)
- Avoid segfaults caused by negated options in 'perf stat' (Michael Petlan)
- Fix processing of dereferenced args in bprintk events in libtracevent (Steven Rostedt)
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
Fix spectre v1 vuln in drm_ioctl
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
From: Sean Paul <sean@poorly.run>
Link: https://patchwork.freedesktop.org/patch/msgid/20181220165740.GA42344@art_vandelay
|
|
|
|
|
|
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k fix from Geert Uytterhoeven:
"Fix memblock-related crashes"
* tag 'm68k-for-v4.20-tag2' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: Fix memblock-related crashes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fix from Masahiro Yamada:
"Fix false positive warning/error about missing library for objtool"
* tag 'kbuild-fixes-v4.20-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: fix false positive warning/error about missing libelf
|