Age | Commit message (Collapse) | Author | Files | Lines |
|
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
irrelevant for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
To resolve conflicts with the PV TLB flush series.
|
|
The CLDEMOTE instruction hints to hardware that the cache line that
contains the linear address should be moved("demoted") from
the cache(s) closest to the processor core to a level more distant
from the processor core. This may accelerate subsequent accesses
to the line by other cores in the same coherence domain,
especially if the line was written by the core that demotes the line.
This patch exposes the cldemote feature to the guest.
The release document ref below link:
https://software.intel.com/sites/default/files/managed/c5/15/\
architecture-instruction-set-extensions-programming-reference.pdf
This patch has a dependency on https://lkml.org/lkml/2018/4/23/928
Signed-off-by: Jingqi Liu <jingqi.liu@intel.com>
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
invvpid
When vmcs12 uses VPID, all TLB entries populated by L2 are tagged with
vmx->nested.vpid02. Currently, INVVPID executed by L1 is emulated by L0
by using INVVPID single/global-context to flush all TLB entries
tagged with vmx->nested.vpid02 regardless of INVVPID type executed by
L1.
However, we can easily optimize the case of L1 INVVPID on an
individual-address. Just INVVPID given individual-address tagged with
vmx->nested.vpid02.
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[Squashed with a preparatory patch that added the !operand.vpid line.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Since commit 5c614b3583e7 ("KVM: nVMX: nested VPID emulation"),
vmcs01 and vmcs02 don't share the same VPID. vmcs01 uses vmx->vpid
while vmcs02 uses vmx->nested.vpid02. This was done such that TLB
flush could be avoided when switching between L1 and L2.
However, the above mentioned commit only changed L2 VMEntry logic to
not flush TLB when switching from L1 to L2. It forgot to also remove
the TLB flush which is done when simulating a VMExit from L2 to L1.
To fix this issue, on VMExit from L2 to L1 we flush TLB only in case
vmcs01 enables VPID and vmcs01->vpid==vmcs02->vpid. This happens when
vmcs01 enables VPID and vmcs12 does not.
Fixes: 5c614b3583e7 ("KVM: nVMX: nested VPID emulation")
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
This is a fix from reviewing the code, but it looks like it might be
able to lead to an Oops. It affects 32bit systems.
The KVM_MEMORY_ENCRYPT_REG_REGION ioctl uses a u64 for range->addr and
range->size but the high 32 bits would be truncated away on a 32 bit
system. This is harmless but it's also harmless to prevent it.
Then in sev_pin_memory() the "uaddr + ulen" calculation can wrap around.
The wrap around can happen on 32 bit or 64 bit systems, but I was only
able to figure out a problem for 32 bit systems. We would pick a number
which results in "npages" being zero. The sev_pin_memory() would then
return ZERO_SIZE_PTR without allocating anything.
I made it illegal to call sev_pin_memory() with "ulen" set to zero.
Hopefully, that doesn't cause any problems. I also changed the type of
"first" and "last" to long, just for cosmetic reasons. Otherwise on a
64 bit system you're saving "uaddr >> 12" in an int and it truncates the
high 20 bits away. The math works in the current code so far as I can
see but it's just weird.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
[Brijesh noted that the code is only reachable on X86_64.]
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
handle_mmio_page_fault() was recently moved to be an internal-only
MMU function, i.e. it's static and no longer defined in kvm_host.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Enforce the invariant that existing VMCS12 field offsets must not
change. Experience has shown that without strict enforcement, this
invariant will not be maintained.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Changed the code to use BUILD_BUG_ON_MSG instead of better, but GCC 4.6
requiring _Static_assert. - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Changing the VMCS12 layout will break save/restore compatibility with
older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
accepted upstream. Google has already been using these ioctls for some
time, and we implore the community not to disturb the existing layout.
Move the four most recently added fields to preserve the offsets of
the previously defined fields and reserve locations for the vmread and
vmwrite bitmaps, which will be used in the virtualization of VMCS
shadowing (to improve the performance of double-nesting).
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Kept the SDM order in vmcs_field_to_offset_table. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The hypercall was added using a struct timespec based implementation,
but we should not use timespec in new code.
This changes it to timespec64. There is no functional change
here since the implementation is only used in 64-bit kernels
that use the same definition for timespec and timespec64.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
When saving a vCPU's nested state, the vmcs02 is discarded. Only the
shadow vmcs12 is saved. The shadow vmcs12 contains all of the
information needed to reconstruct an equivalent vmcs02 on restore, but
we have to be able to deal with two contexts:
1. The nested state was saved immediately after an emulated VM-entry,
before the vmcs02 was ever launched.
2. The nested state was saved some time after the first successful
launch of the vmcs02.
Though it's an implementation detail rather than an architected bit,
vmx->nested_run_pending serves to distinguish between these two
cases. Hence, we save it as part of the vCPU's nested state. (Yes,
this is ugly.)
Even when restoring from a checkpoint, it may be necessary to build
the vmcs02 as if prepare_vmcs02 was called from nested_vmx_run. So,
the 'from_vmentry' argument should be dropped, and
vmx->nested_run_pending should be consulted instead. The nested state
restoration code then has to set vmx->nested_run_pending prior to
calling prepare_vmcs02. It's important that the restoration code set
vmx->nested_run_pending anyway, since the flag impacts things like
interrupt delivery as well.
Fixes: cf8b84f48a59 ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The Hyper-V APIC code is built when CONFIG_HYPERV is enabled but the actual
code in that file is guarded with CONFIG_X86_64. There is no point in doing
this. Neither is there a point in having the CONFIG_HYPERV guard in there
because the containing directory is not built when CONFIG_HYPERV=n.
Further for the hv_init_apic() function a stub is provided only for
CONFIG_HYPERV=n, which is pointless as the callsite is not compiled at
all. But for X86_32 the stub is missing and the build fails.
Clean that up:
- Compile hv_apic.c only when CONFIG_X86_64=y
- Make the stub for hv_init_apic() available when CONFG_X86_64=n
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
|
|
Not all configurations magically include asm/apic.h, but the Hyper-V code
requires it. Include it explicitely.
Fixes: 6b48cb5f8347 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
|
|
Consolidate the allocation of the hypercall input page.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-5-kys@linuxonhyperv.com
|
|
Consolidate code for converting cpumask to vpset.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-4-kys@linuxonhyperv.com
|
|
Support enhanced IPI enlightenments (to target more than 64 CPUs).
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-3-kys@linuxonhyperv.com
|
|
Hyper-V supports hypercalls to implement IPI; use them.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-2-kys@linuxonhyperv.com
|
|
Hyper-V supports MSR based APIC access; implement
the enlightenment.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Cc: olaf@aepfle.de
Cc: sthemmin@microsoft.com
Cc: gregkh@linuxfoundation.org
Cc: jasowang@redhat.com
Cc: Michael.H.Kelley@microsoft.com
Cc: hpa@zytor.com
Cc: apw@canonical.com
Cc: devel@linuxdriverproject.org
Cc: vkuznets@redhat.com
Link: https://lkml.kernel.org/r/20180516215334.6547-1-kys@linuxonhyperv.com
|
|
Merge misc fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hfsplus: stop workqueue when fill_super() failed
mm: don't allow deferred pages with NEED_PER_CPU_KM
MAINTAINERS: add Q: entry to kselftest for patchwork project
radix tree: fix multi-order iteration race
radix tree test suite: multi-order iteration race
radix tree test suite: add item_delete_rcu()
radix tree test suite: fix compilation issue
radix tree test suite: fix mapshift build target
include/linux/mm.h: add new inline function vmf_error()
lib/test_bitmap.c: fix bitmap optimisation tests to report errors correctly
|
|
git://git.infradead.org/linux-platform-drivers-x86
Pull x86 platform driver fix from Darren Hart:
"Remove the last of the "select DELL_SMBIOS" references in the Kconfig"
* tag 'platform-drivers-x86-v4.17-3' of git://git.infradead.org/linux-platform-drivers-x86:
platform/x86: DELL_WMI use depends on instead of select for DELL_SMBIOS
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
- a modified revert of a patch that made new choices come out for a
couple stm32 clk drivers that really always need to be there when
that particular machine is compiled in
- boot fix on i.MX for Stefan who noticed odd behavior from the
critical flag patch that came in during the merge window
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: stm32: fix: stm32 clock drivers are not compiled by default
clk: imx6ull: use OSC clock during AXI rate change
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"A bunch of driver bugfixes and a MAINTAINERS addition"
* 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
MAINTAINERS: add entry for STM32 I2C driver
i2c: viperboard: return message count on master_xfer success
i2c: pmcmsp: fix error return from master_xfer
i2c: pmcmsp: return message count on master_xfer success
i2c: designware: fix poll-after-enable regression
eeprom: at24: fix retrieving the at24_chip_data structure
i2c: core: ACPI: Log device not acking errors at dbg loglevel
i2c: core: ACPI: Improve OpRegion read errors
|
|
syzbot is reporting ODEBUG messages at hfsplus_fill_super() [1]. This
is because hfsplus_fill_super() forgot to call cancel_delayed_work_sync().
As far as I can see, it is hfsplus_mark_mdb_dirty() from
hfsplus_new_inode() in hfsplus_fill_super() that calls
queue_delayed_work(). Therefore, I assume that hfsplus_new_inode() does
not fail if queue_delayed_work() was called, and the out_put_hidden_dir
label is the appropriate location to call cancel_delayed_work_sync().
[1] https://syzkaller.appspot.com/bug?id=a66f45e96fdbeb76b796bf46eb25ea878c42a6c9
Link: http://lkml.kernel.org/r/964a8b27-cd69-357c-fe78-76b066056201@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+4f2e5f086147d543ab03@syzkaller.appspotmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
It is unsafe to do virtual to physical translations before mm_init() is
called if struct page is needed in order to determine the memory section
number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
we initialize struct pages for all the allocated memory when deferred
struct pages are used.
My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
during boot") exposed this problem, because it greatly reduced number of
pages that are initialized before mm_init(), but the problem existed
even before my fix, as Fengguang Wu found.
Below is a more detailed explanation of the problem.
We initialize struct pages in four places:
1. Early in boot a small set of struct pages is initialized to fill the
first section, and lower zones.
2. During mm_init() we initialize "struct pages" for all the memory that
is allocated, i.e reserved in memblock.
3. Using on-demand logic when pages are allocated after mm_init call
(when memblock is finished)
4. After smp_init() when the rest free deferred pages are initialized.
The problem occurs if we try to do va to phys translation of a memory
between steps 1 and 2. Because we have not yet initialized struct pages
for all the reserved pages, it is inherently unsafe to do va to phys if
the translation itself requires access of "struct page" as in case of
this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP
The following path exposes the problem:
start_kernel()
trap_init()
setup_cpu_entry_areas()
setup_cpu_entry_area(cpu)
get_cpu_gdt_paddr(cpu)
per_cpu_ptr_to_phys(addr)
pcpu_addr_to_page(addr)
virt_to_page(addr)
pfn_to_page(__pa(addr) >> PAGE_SHIFT)
We disable this path by not allowing NEED_PER_CPU_KM with deferred
struct pages feature.
The problems are discussed in these threads:
http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com
Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dennis Zhou <dennisszhou@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
A new patchwork project is created to track kselftest patches. Update
the kselftest entry in the MAINTAINERS file adding 'Q:' entry:
https://patchwork.kernel.org/project/linux-kselftest/list/
Link: http://lkml.kernel.org/r/20180515164427.12201-1-shuah@kernel.org
Signed-off-by: Shuah Khan (Samsung OSG) <shuah@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fix a race in the multi-order iteration code which causes the kernel to
hit a GP fault. This was first seen with a production v4.15 based
kernel (4.15.6-300.fc27.x86_64) utilizing a DAX workload which used
order 9 PMD DAX entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember for example that
an order 2 entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
We fix this race by fixing the way that skip_siblings() detects sibling
nodes. Instead of testing against the preceding slot we instead look
for siblings via is_sibling_entry() which compares against the position
of the struct radix_tree_node.slots[] array. This ensures that sibling
entries are properly identified, even if they are no longer contiguous
with the 'entry' they point to.
Link: http://lkml.kernel.org/r/20180503192430.7582-6-ross.zwisler@linux.intel.com
Fixes: 148deab223b2 ("radix-tree: improve multiorder iterators")
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: CR, Sapthagirish <sapthagirish.cr@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a test which shows a race in the multi-order iteration code. This
test reliably hits the race in under a second on my machine, and is the
result of a real bug report against kernel a production v4.15 based
kernel (4.15.6-300.fc27.x86_64). With a real kernel this issue is hit
when using order 9 PMD DAX radix tree entries.
The race has to do with how we tear down multi-order sibling entries
when we are removing an item from the tree. Remember that an order 2
entry looks like this:
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
where 'entry' is in some slot in the struct radix_tree_node, and the
three slots following 'entry' contain sibling pointers which point back
to 'entry.'
When we delete 'entry' from the tree, we call :
radix_tree_delete()
radix_tree_delete_item()
__radix_tree_delete()
replace_slot()
replace_slot() first removes the siblings in order from the first to the
last, then at then replaces 'entry' with NULL. This means that for a
brief period of time we end up with one or more of the siblings removed,
so:
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
This causes an issue if you have a reader iterating over the slots in
the tree via radix_tree_for_each_slot() while only under
rcu_read_lock()/rcu_read_unlock() protection. This is a common case in
mm/filemap.c.
The issue is that when __radix_tree_next_slot() => skip_siblings() tries
to skip over the sibling entries in the slots, it currently does so with
an exact match on the slot directly preceding our current slot.
Normally this works:
V preceding slot
struct radix_tree_node.slots[] = [entry][sibling][sibling][sibling]
^ current slot
This lets you find the first sibling, and you skip them all in order.
But in the case where one of the siblings is NULL, that slot is skipped
and then our sibling detection is interrupted:
V preceding slot
struct radix_tree_node.slots[] = [entry][NULL][sibling][sibling]
^ current slot
This means that the sibling pointers aren't recognized since they point
all the way back to 'entry', so we think that they are normal internal
radix tree pointers. This causes us to think we need to walk down to a
struct radix_tree_node starting at the address of 'entry'.
In a real running kernel this will crash the thread with a GP fault when
you try and dereference the slots in your broken node starting at
'entry'.
In the radix tree test suite this will be caught by the address
sanitizer:
==27063==ERROR: AddressSanitizer: heap-buffer-overflow on address
0x60c0008ae400 at pc 0x00000040ce4f bp 0x7fa89b8fcad0 sp 0x7fa89b8fcac0
READ of size 8 at 0x60c0008ae400 thread T3
#0 0x40ce4e in __radix_tree_next_slot /home/rzwisler/project/linux/tools/testing/radix-tree/radix-tree.c:1660
#1 0x4022cc in radix_tree_next_slot linux/../../../../include/linux/radix-tree.h:567
#2 0x4022cc in iterator_func /home/rzwisler/project/linux/tools/testing/radix-tree/multiorder.c:655
#3 0x7fa8a088d50a in start_thread (/lib64/libpthread.so.0+0x750a)
#4 0x7fa8a03bd16e in clone (/lib64/libc.so.6+0xf516e)
Link: http://lkml.kernel.org/r/20180503192430.7582-5-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently the lifetime of "struct item" entries in the radix tree are
not controlled by RCU, but are instead deleted inline as they are
removed from the tree.
In the following patches we add a test which has threads iterating over
items pulled from the tree and verifying them in an
rcu_read_lock()/rcu_read_unlock() section. This means that though an
item has been removed from the tree it could still be being worked on by
other threads until the RCU grace period expires. So, we need to
actually free the "struct item" structures at the end of the grace
period, just as we do with "struct radix_tree_node" items.
Link: http://lkml.kernel.org/r/20180503192430.7582-4-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pulled from a patch from Matthew Wilcox entitled "xarray: Add definition
of struct xarray":
> From: Matthew Wilcox <mawilcox@microsoft.com>
> Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
https://patchwork.kernel.org/patch/10341249/
These defines fix this compilation error:
In file included from ./linux/radix-tree.h:6:0,
from ./linux/../../../../include/linux/idr.h:15,
from ./linux/idr.h:1,
from idr.c:4:
./linux/../../../../include/linux/idr.h: In function `idr_init_base':
./linux/../../../../include/linux/radix-tree.h:129:2: warning: implicit declaration of function `spin_lock_init'; did you mean `spinlock_t'? [-Wimplicit-function-declaration]
spin_lock_init(&(root)->xa_lock); \
^
./linux/../../../../include/linux/idr.h:126:2: note: in expansion of macro `INIT_RADIX_TREE'
INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
^~~~~~~~~~~~~~~
by providing a spin_lock_init() wrapper for the v4.17-rc* version of the
radix tree test suite.
Link: http://lkml.kernel.org/r/20180503192430.7582-3-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit c6ce3e2fe3da ("radix tree test suite: Add config option for map
shift") introduced a phony makefile target called 'mapshift' that ends
up generating the file generated/map-shift.h. This phony target was
then added as a dependency of the top level 'targets' build target,
which is what is run when you go to tools/testing/radix-tree and just
type 'make'.
Unfortunately, this phony target doesn't actually work as a dependency,
so you end up getting:
$ make
make: *** No rule to make target 'generated/map-shift.h', needed by 'main.o'. Stop.
make: *** Waiting for unfinished jobs....
Fix this by making the file generated/map-shift.h our real makefile
target, and add this a dependency of the top level build target.
Link: http://lkml.kernel.org/r/20180503192430.7582-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: CR, Sapthagirish <sapthagirish.cr@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Many places in drivers/ file systems, error was handled in a common way
like below:
ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
vmf_error() will replace this and return vm_fault_t type err.
A lot of drivers and filesystems currently have a rather complex mapping
of errno-to-VM_FAULT code. We have been able to eliminate a lot of it
by just returning VM_FAULT codes directly from functions which are
called exclusively from the fault handling path.
Some functions can be called both from the fault handler and other
context which are expecting an errno, so they have to continue to return
an errno. Some users still need to choose different behaviour for
different errnos, but vmf_error() captures the essential error
translation that's common to all users, and those that need to handle
additional errors can handle them first.
Link: http://lkml.kernel.org/r/20180510174826.GA14268@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I had neglected to increment the error counter when the tests failed,
which made the tests noisy when they fail, but not actually return an
error code.
Link: http://lkml.kernel.org/r/20180509114328.9887-1-mpe@ellerman.id.au
Fixes: 3cc78125a081 ("lib/test_bitmap.c: add optimisation tests")
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Yury Norov <ynorov@caviumnetworks.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: <stable@vger.kernel.org> [4.13+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If DELL_WMI "select"s DELL_SMBIOS, the DELL_SMBIOS dependencies are
ignored and it is still possible to end up with unmet direct
dependencies.
Change the select to a depends on.
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Just three commits.
The two cxl ones are not fixes per se, but they modify code that was
added this cycle so that it will work with a recent firmware change.
And then a fix for a recent commit that added sleeps in the NVRAM
code, which needs to be more careful and not sleep if eg. we're called
in the panic() path.
Thanks to Nicholas Piggin, Philippe Bergheaud, Christophe Lombard"
* tag 'powerpc-4.17-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/powernv: Fix NVRAM sleep in invalid context when crashing
cxl: Report the tunneled operations status
cxl: Set the PBCQ Tunnel BAR register when enabling capi mode
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"Fix an ACPICA regression introduced in this cycle and related to the
handling of package objects loaded by the Load and loadTable AML
operators that are not initialized properly after recent changes (Bob
Moore)"
* tag 'acpi-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPICA: Add deferred package support for the Load and loadTable operators
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fix from Rafael Wysocki:
"Fix Kconfig dependencies of the armada-37xx cpufreq driver (Miquel
Raynal)"
* tag 'pm-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: armada-37xx: driver relies on cpufreq-dt
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB driver fixes fro 4.17-rc6.
They resolve some reported bugs in the musb driver, the xhci driver,
and a number of small fixes for the usbip driver.
All of these have been in linux-next with no reported issues"
* tag 'usb-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usbip: usbip_host: fix bad unlock balance during stub_probe()
usbip: usbip_host: fix NULL-ptr deref and use-after-free errors
usbip: usbip_host: run rebind from exit when module is removed
usbip: usbip_host: delete device from busid_table after rebind
usbip: usbip_host: refine probe and disconnect debug msgs to be useful
usb: musb: fix remote wakeup racing with suspend
xhci: Fix USB3 NULL pointer dereference at logical disconnect.
|
|
Pull block fix from Jens Axboe:
"Single fix this time, from Coly, fixing a failure case when
CONFIG_DEBUGFS isn't enabled"
* tag 'for-linus-20180518' of git://git.kernel.dk/linux-block:
bcache: return 0 from bch_debug_init() if CONFIG_DEBUG_FS=n
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes accumilated since the merge window, all
fairly small and driver specific"
* tag 'spi-fix-v4.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: bcm2835aux: ensure interrupts are enabled for shared handler
spi: bcm-qspi: Always read and set BSPI_MAST_N_BOOT_CTRL
spi: bcm-qspi: Avoid setting MSPI_CDRAM_PCS for spi-nor master
spi: pxa2xx: Allow 64-bit DMA
spi: cadence: Add usleep_range() for cdns_spi_fill_tx_fifo()
spi: sh-msiof: Fix bit field overflow writes to TSCR/RSCR
spi: imx: Update MODULE_DESCRIPTION to "SPI Controller driver"
|
|
Pull mtd fixes from Boris Brezillon:
"NAND fixes:
- Fix read path of the Marvell NAND driver
- Make sure we don't pass a u64 to ndelay()
CFI fix:
- Fix the map_word_andequal() implementation"
* tag 'mtd/fixes-for-4.17-rc6' of git://git.infradead.org/linux-mtd:
mtd: rawnand: Fix return type of __DIVIDE() when called with 32-bit
mtd: rawnand: marvell: Fix read logic for layouts with ->nchunks > 2
mtd: Fix comparison in map_word_andequal()
|
|
git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Pretty quiet week again: one vmwgfx regression fix, one core buffer
overflow fix, one vc4 leak fix and three i915 fixes"
* tag 'drm-fixes-for-v4.17-rc6' of git://people.freedesktop.org/~airlied/linux:
drm/dumb-buffers: Integer overflow in drm_mode_create_ioctl()
drm/i915/gen9: Add WaClearHIZ_WM_CHICKEN3 for bxt and glk
drm/vmwgfx: Set dmabuf_size when vmw_dmabuf_init is successful
drm/vc4: Fix leak of the file_priv that stored the perfmon.
drm/i915/execlists: Use rmb() to order CSB reads
drm/i915/userptr: reject zero user_size
drm: Match sysfs name in link removal to link creation
|
|
git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
- Userptr IOCTL zero size check (Matt)
- Two hardware quirk fixes (Michel & Chris)
* tag 'drm-intel-fixes-2018-05-17' of git://anongit.freedesktop.org/drm/drm-intel:
drm/i915/gen9: Add WaClearHIZ_WM_CHICKEN3 for bxt and glk
drm/i915/execlists: Use rmb() to order CSB reads
drm/i915/userptr: reject zero user_size
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
"Two k10temp fixes:
- fix race condition when accessing System Management Network
registers
- fix reading critical temperatures on F15h M60h and M70h
Also add PCI ID's for the AMD Raven Ridge root bridge"
* tag 'hwmon-for-linus-v4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (k10temp) Use API function to access System Management Network
x86/amd_nb: Add support for Raven Ridge CPUs
hwmon: (k10temp) Fix reading critical temperature register
|
|
Pull kvm fixes from Paolo Bonzini:
- ARM/ARM64 locking fixes
- x86 fixes: PCID, UMIP, locking
- improved support for recent Windows version that have a 2048 Hz APIC
timer
- rename KVM_HINTS_DEDICATED CPUID bit to KVM_HINTS_REALTIME
- better behaved selftests
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: rename KVM_HINTS_DEDICATED to KVM_HINTS_REALTIME
KVM: arm/arm64: VGIC/ITS save/restore: protect kvm_read_guest() calls
KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock
KVM: arm/arm64: VGIC/ITS: Promote irq_lock() in update_affinity
KVM: arm/arm64: Properly protect VGIC locks from IRQs
KVM: X86: Lower the default timer frequency limit to 200us
KVM: vmx: update sec exec controls for UMIP iff emulating UMIP
kvm: x86: Suppress CR3_PCID_INVD bit only when PCIDs are enabled
KVM: selftests: exit with 0 status code when tests cannot be run
KVM: hyperv: idr_find needs RCU protection
x86: Delay skip of emulated hypercall instruction
KVM: Extend MAX_IRQ_ROUTES to 4096 for all archs
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"We have a core fix in the compat code for covering a potential race
(double references), but it's a very minor change.
The rest are all small device-specific quirks, as well as a correction
of the new UAC3 support code"
* tag 'sound-4.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Use Class Specific EP for UAC3 devices.
ALSA: hda/realtek - Clevo P950ER ALC1220 Fixup
ALSA: usb: mixer: volume quirk for CM102-A+/102S+
ALSA: hda: Add Lenovo C50 All in one to the power_save blacklist
ALSA: control: fix a redundant-copy issue
|
|
KVM_HINTS_DEDICATED seems to be somewhat confusing:
Guest doesn't really care whether it's the only task running on a host
CPU as long as it's not preempted.
And there are more reasons for Guest to be preempted than host CPU
sharing, for example, with memory overcommit it can get preempted on a
memory access, post copy migration can cause preemption, etc.
Let's call it KVM_HINTS_REALTIME which seems to better
match what guests expect.
Also, the flag most be set on all vCPUs - current guests assume this.
Note so in the documentation.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
- a fix for the vfio ccw translation code
- update an incorrect email address in the MAINTAINERS file
- fix a division by zero oops in the cpum_sf code found by trinity
- two fixes for the error handling of the qdio code
- several spectre related patches to convert all left-over indirect
branches in the kernel to expoline branches
- update defconfigs to avoid warnings due to the netfilter Kconfig
changes
- avoid several compiler warnings in the kexec_file code for s390
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: don't release memory in qdio_setup_irq()
s390/qdio: fix access to uninitialized qdio_q fields
s390/cpum_sf: ensure sample frequency of perf event attributes is non-zero
s390: use expoline thunks in the BPF JIT
s390: extend expoline to BC instructions
s390: remove indirect branch from do_softirq_own_stack
s390: move spectre sysfs attribute code
s390/kernel: use expoline for indirect branches
s390/ftrace: use expoline for indirect branches
s390/lib: use expoline for indirect branches
s390/crc32-vx: use expoline for indirect branches
s390: move expoline assembler macros to a header
vfio: ccw: fix cleanup if cp_prefetch fails
s390/kexec_file: add declaration of purgatory related globals
s390: update defconfigs
MAINTAINERS: update s390 zcrypt maintainers email address
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux
Pull SELinux fixes from Paul Moore:
"A small pull request to fix a few regressions in the SELinux/SCTP code
with applications that call bind() with AF_UNSPEC/INADDR_ANY.
The individual commit descriptions have more information, but the
commits themselves should be self explanatory"
* tag 'selinux-pr-20180516' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
selinux: correctly handle sa_family cases in selinux_sctp_bind_connect()
selinux: fix address family in bind() and connect() to match address/port
selinux: add AF_UNSPEC and INADDR_ANY checks to selinux_socket_bind()
|
|
proc_pid_cmdline_read() and environ_read() directly access the target
process' VM to retrieve the command line and environment. If this
process remaps these areas onto a file via mmap(), the requesting
process may experience various issues such as extra delays if the
underlying device is slow to respond.
Let's simply refuse to access file-backed areas in these functions.
For this we add a new FOLL_ANON gup flag that is passed to all calls
to access_remote_vm(). The code already takes care of such failures
(including unmapped areas). Accesses via /proc/pid/mem were not
changed though.
This was assigned CVE-2018-1120.
Note for stable backports: the patch may apply to kernels prior to 4.11
but silently miss one location; it must be checked that no call to
access_remote_vm() keeps zero as the last argument.
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|