summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2023-01-12x86/mm: fix poking_init() for Xen PV guestsJuergen Gross1-0/+4
Commit 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") broke the kernel for running as Xen PV guest. It seems as if the new address space is never activated before being used, resulting in Xen rejecting to accept the new CR3 value (the PGD isn't pinned). Fix that by adding the now missing call of paravirt_arch_dup_mmap() to poking_init(). That call was previously done by dup_mm()->dup_mmap() and it is a NOP for all cases but for Xen PV, where it is just doing the pinning of the PGD. Fixes: 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230109150922.10578-1-jgross@suse.com
2023-01-12KVM: x86/xen: Avoid deadlock by adding kvm->arch.xen.xen_lock leaf node lockDavid Woodhouse2-37/+31
In commit 14243b387137a ("KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery") the clever version of me left some helpful notes for those who would come after him: /* * For the irqfd workqueue, using the main kvm->lock mutex is * fine since this function is invoked from kvm_set_irq() with * no other lock held, no srcu. In future if it will be called * directly from a vCPU thread (e.g. on hypercall for an IPI) * then it may need to switch to using a leaf-node mutex for * serializing the shared_info mapping. */ mutex_lock(&kvm->lock); In commit 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") the other version of me ran straight past that comment without reading it, and introduced a potential deadlock by taking vcpu->mutex and kvm->lock in the wrong order. Solve this as originally suggested, by adding a leaf-node lock in the Xen state rather than using kvm->lock for it. Fixes: 2fd6df2f2b47 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-4-dwmw2@infradead.org> [Rebase, add docs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-12x86/pci: Simplify is_mmconf_reserved() messagesBjorn Helgaas1-6/+7
is_mmconf_reserved() takes a "with_e820" parameter that only determines the message logged if it finds the MMCONFIG region is reserved. Pass the message directly, which will simplify a future patch that adds a new way of looking for that reservation. No functional change intended. Link: https://lore.kernel.org/r/20230110180243.1590045-2-helgaas@kernel.org Tested-by: Tony Luck <tony.luck@intel.com> Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
2023-01-11KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()David Woodhouse1-2/+17
The kvm_xen_update_runstate_guest() function can be called when the vCPU is being scheduled out, from a preempt notifier. It *opportunistically* updates the runstate area in the guest memory, if the gfn_to_pfn_cache which caches the appropriate address is still valid. If there is *contention* when it attempts to obtain gpc->lock, then locking inside the priority inheritance checks may cause a deadlock. Lockdep reports: [13890.148997] Chain exists of: &gpc->lock --> &p->pi_lock --> &rq->__lock [13890.149002] Possible unsafe locking scenario: [13890.149003] CPU0 CPU1 [13890.149004] ---- ---- [13890.149005] lock(&rq->__lock); [13890.149007] lock(&p->pi_lock); [13890.149009] lock(&rq->__lock); [13890.149011] lock(&gpc->lock); [13890.149013] *** DEADLOCK *** In the general case, if there's contention for a read lock on gpc->lock, that's going to be because something else is either invalidating or revalidating the cache. Either way, we've raced with seeing it in an invalid state, in which case we would have aborted the opportunistic update anyway. So in the 'atomic' case when called from the preempt notifier, just switch to using read_trylock() and avoid the PI handling altogether. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-11KVM: x86/xen: Fix lockdep warning on "recursive" gpc lockingDavid Woodhouse1-1/+3
In commit 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") we declared it safe to obtain two gfn_to_pfn_cache locks at the same time: /* * The guest's runstate_info is split across two pages and we * need to hold and validate both GPCs simultaneously. We can * declare a lock ordering GPC1 > GPC2 because nothing else * takes them more than one at a time. */ However, we forgot to tell lockdep. Do so, by setting a subclass on the first lock before taking the second. Fixes: 5ec3289b31 ("KVM: x86/xen: Compatibility fixes for shared runstate area") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-10x86/resctrl: Fix event counts regression in reused RMIDsPeter Newman1-16/+33
When creating a new monitoring group, the RMID allocated for it may have been used by a group which was previously removed. In this case, the hardware counters will have non-zero values which should be deducted from what is reported in the new group's counts. resctrl_arch_reset_rmid() initializes the prev_msr value for counters to 0, causing the initial count to be charged to the new group. Resurrect __rmid_read() and use it to initialize prev_msr correctly. Unlike before, __rmid_read() checks for error bits in the MSR read so that callers don't need to. Fixes: 1d81d15db39c ("x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20221220164132.443083-1-peternewman@google.com
2023-01-10x86/resctrl: Fix task CLOSID/RMID update racePeter Newman1-1/+11
When the user moves a running task to a new rdtgroup using the task's file interface or by deleting its rdtgroup, the resulting change in CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the task(s) CPUs. x86 allows reordering loads with prior stores, so if the task starts running between a task_curr() check that the CPU hoisted before the stores in the CLOSID/RMID update then it can start running with the old CLOSID/RMID until it is switched again because __rdtgroup_move_task() failed to determine that it needs to be interrupted to obtain the new CLOSID/RMID. Refer to the diagram below: CPU 0 CPU 1 ----- ----- __rdtgroup_move_task(): curr <- t1->cpu->rq->curr __schedule(): rq->curr <- t1 resctrl_sched_in(): t1->{closid,rmid} -> {1,1} t1->{closid,rmid} <- {2,2} if (curr == t1) // false IPI(t1->cpu) A similar race impacts rdt_move_group_tasks(), which updates tasks in a deleted rdtgroup. In both cases, use smp_mb() to order the task_struct::{closid,rmid} stores before the loads in task_curr(). In particular, in the rdt_move_group_tasks() case, simply execute an smp_mb() on every iteration with a matching task. It is possible to use a single smp_mb() in rdt_move_group_tasks(), but this would require two passes and a means of remembering which task_structs were updated in the first loop. However, benchmarking results below showed too little performance impact in the simple approach to justify implementing the two-pass approach. Times below were collected using `perf stat` to measure the time to remove a group containing a 1600-task, parallel workload. CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads) # mkdir /sys/fs/resctrl/test # echo $$ > /sys/fs/resctrl/test/tasks # perf bench sched messaging -g 40 -l 100000 task-clock time ranges collected using: # perf stat rmdir /sys/fs/resctrl/test Baseline: 1.54 - 1.60 ms smp_mb() every matching task: 1.57 - 1.67 ms [ bp: Massage commit message. ] Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR") Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount") Signed-off-by: Peter Newman <peternewman@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Babu Moger <babu.moger@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
2023-01-10x86/pat: Fix pat_x_mtrr_type() for MTRR disabled caseJuergen Gross1-1/+2
Since 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") PAT can be enabled without MTRR. This has resulted in problems e.g. for a SEV-SNP guest running under Hyper-V, when trying to establish a new mapping via memremap() with WB caching mode, as pat_x_mtrr_type() will call mtrr_type_lookup(), which in turn is returning MTRR_TYPE_INVALID due to MTRR being disabled in this configuration. The result is a mapping with UC- caching, leading to severe performance degradation. Fix that by handling MTRR_TYPE_INVALID the same way as MTRR_TYPE_WRBACK in pat_x_mtrr_type() because MTRR_TYPE_INVALID means MTRRs are disabled. [ bp: Massage commit message. ] Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen") Reported-by: Michael Kelley (LINUX) <mikelley@microsoft.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Michael Kelley <mikelley@microsoft.com> Tested-by: Michael Kelley <mikelley@microsoft.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230110065427.20767-1-jgross@suse.com
2023-01-10x86/boot: Avoid using Intel mnemonics in AT&T syntax asmPeter Zijlstra1-2/+2
With 'GNU assembler (GNU Binutils for Debian) 2.39.90.20221231' the build now reports: arch/x86/realmode/rm/../../boot/bioscall.S: Assembler messages: arch/x86/realmode/rm/../../boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/realmode/rm/../../boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S: Assembler messages: arch/x86/boot/bioscall.S:35: Warning: found `movsd'; assuming `movsl' was meant arch/x86/boot/bioscall.S:70: Warning: found `movsd'; assuming `movsl' was meant Which is due to: PR gas/29525 Note that with the dropped CMPSD and MOVSD Intel Syntax string insn templates taking operands, mixed IsString/non-IsString template groups (with memory operands) cannot occur anymore. With that maybe_adjust_templates() becomes unnecessary (and is hence being removed). More details: https://sourceware.org/bugzilla/show_bug.cgi?id=29525 Borislav Petkov further explains: " the particular problem here is is that the 'd' suffix is "conflicting" in the sense that you can have SSE mnemonics like movsD %xmm... and the same thing also for string ops (which is the case here) so apparently the agreement in binutils land is to use the always accepted suffixes 'l' or 'q' and phase out 'd' slowly... " Fixes: 7a734e7dd93b ("x86, setup: "glove box" BIOS calls -- infrastructure") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/Y71I3Ex2pvIxMpsP@hirez.programming.kicks-ass.net
2023-01-09perf/x86/intel/uncore: Add Emerald RapidsKan Liang1-0/+1
From the perspective of the uncore PMU, the new Emerald Rapids is the same as the Sapphire Rapids. The only difference is the event list, which will be supported in the perf tool later. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230106160449.3566477-4-kan.liang@linux.intel.com
2023-01-09perf/x86/msr: Add Emerald RapidsKan Liang1-0/+1
The same as Sapphire Rapids, the SMI_COUNT MSR is also supported on Emerald Rapids. Add Emerald Rapids model. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230106160449.3566477-3-kan.liang@linux.intel.com
2023-01-09perf/x86/msr: Add Meteor Lake supportKan Liang1-0/+2
Meteor Lake is Intel's successor to Raptor lake. PPERF and SMI_COUNT MSRs are also supported. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/20230104201349.1451191-7-kan.liang@linux.intel.com
2023-01-09perf/x86/cstate: Add Meteor Lake supportKan Liang1-9/+12
Meteor Lake is Intel's successor to Raptor lake. From the perspective of Intel cstate residency counters, there is nothing changed compared with Raptor lake. Share adl_cstates with Raptor lake. Update the comments for Meteor Lake. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Link: https://lore.kernel.org/r/20230104201349.1451191-6-kan.liang@linux.intel.com
2023-01-09KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUIDPaolo Bonzini1-16/+16
Passing the host topology to the guest is almost certainly wrong and will confuse the scheduler. In addition, several fields of these CPUID leaves vary on each processor; it is simply impossible to return the right values from KVM_GET_SUPPORTED_CPUID in such a way that they can be passed to KVM_SET_CPUID2. The values that will most likely prevent confusion are all zeroes. Userspace will have to override it anyway if it wishes to present a specific topology to the guest. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-09KVM: nSVM: clarify recalc_intercepts() wrt CR8Paolo Bonzini1-7/+5
The mysterious comment "We only want the cr8 intercept bits of L1" dates back to basically the introduction of nested SVM, back when the handling of "less typical" hypervisors was very haphazard. With the development of kvm-unit-tests for interrupt handling, the same code grew another vmcb_clr_intercept for the interrupt window (VINTR) vmexit, this time with a comment that is at least decent. It turns out however that the same comment applies to the CR8 write intercept, which is also a "recheck if an interrupt should be injected" intercept. The CR8 read intercept instead has not been used by KVM for 14 years (commit 649d68643ebf, "KVM: SVM: sync TPR value to V_TPR field in the VMCB"), so do not bother clearing it and let one comment describe both CR8 write and VINTR handling. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-01-09x86/xen: Remove the unused function p2m_index()Jiapeng Chong1-5/+0
The function p2m_index is defined in the p2m.c file, but not called elsewhere, so remove this unused function. arch/x86/xen/p2m.c:137:24: warning: unused function 'p2m_index'. Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=3557 Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20230105090141.36248-1-jiapeng.chong@linux.alibaba.com Signed-off-by: Juergen Gross <jgross@suse.com>
2023-01-06Merge tag 'perf-urgent-2023-01-06' of ↵Linus Torvalds1-0/+5
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fix from Ingo Molnar: "Intel RAPL updates for new model IDs" * tag 'perf-urgent-2023-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Add support for Intel Emerald Rapids perf/x86/rapl: Add support for Intel Meteor Lake perf/x86/rapl: Treat Tigerlake like Icelake
2023-01-05elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size}Catalin Marinas1-2/+2
A subsequent fix for arm64 will use this parameter to parse the vma information from the snapshot created by dump_vma_snapshot() rather than traversing the vma list without the mmap_lock. Fixes: 6dd8b1a0b6cb ("arm64: mte: Dump the MTE tags in the core file") Cc: <stable@vger.kernel.org> # 5.18.x Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Seth Jenkins <sethjenkins@google.com> Suggested-by: Seth Jenkins <sethjenkins@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221222181251.1345752-3-catalin.marinas@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2023-01-04perf/x86/rapl: Add support for Intel Emerald RapidsZhang Rui1-0/+1
Emerald Rapids RAPL support is the same as previous Sapphire Rapids. Add Emerald Rapids model for RAPL. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230104145831.25498-2-rui.zhang@intel.com
2023-01-04perf/x86/rapl: Add support for Intel Meteor LakeZhang Rui1-0/+2
Meteor Lake RAPL support is the same as previous Sky Lake. Add Meteor Lake model for RAPL. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230104145831.25498-1-rui.zhang@intel.com
2023-01-04x86/bugs: Flush IBP in ib_prctl_set()Rodrigo Branco1-0/+2
We missed the window between the TIF flag update and the next reschedule. Signed-off-by: Rodrigo Branco <bsdaemon@google.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org>
2023-01-03perf/x86/rapl: Treat Tigerlake like IcelakeChris Wilson1-0/+2
Since Tigerlake seems to have inherited its cstates and other RAPL power caps from Icelake, assume it also follows Icelake for its RAPL events. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Zhang Rui <rui.zhang@intel.com> Link: https://lore.kernel.org/r/20221228113454.1199118-1-rodrigo.vivi@intel.com
2023-01-03x86/insn: Avoid namespace clash by separating instruction decoder MMIO type ↵Jason A. Donenfeld4-41/+41
from MMIO trace type Both <linux/mmiotrace.h> and <asm/insn-eval.h> define various MMIO_ enum constants, whose namespace overlaps. Rename the <asm/insn-eval.h> ones to have a INSN_ prefix, so that the headers can be used from the same source file. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20230101162910.710293-2-Jason@zx2c4.com
2023-01-03x86/asm: Fix an assembler warning with current binutilsMikulas Patocka1-1/+1
Fix a warning: "found `movsd'; assuming `movsl' was meant" Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: linux-kernel@vger.kernel.org
2023-01-02x86/kexec: Fix double-free of elf header bufferTakashi Iwai1-3/+1
After b3e34a47f989 ("x86/kexec: fix memory leak of elf header buffer"), freeing image->elf_headers in the error path of crash_load_segments() is not needed because kimage_file_post_load_cleanup() will take care of that later. And not clearing it could result in a double-free. Drop the superfluous vfree() call at the error path of crash_load_segments(). Fixes: b3e34a47f989 ("x86/kexec: fix memory leak of elf header buffer") Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20221122115122.13937-1-tiwai@suse.de
2023-01-01Merge tag 'perf_urgent_for_v6.2_rc2' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Pass only an initialized perf event attribute to the LSM hook - Fix a use-after-free on the perf syscall's error path - A potential integer overflow fix in amd_core_pmu_init() - Fix the cgroup events tracking after the context handling rewrite - Return the proper value from the inherit_event() function on error * tag 'perf_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Call LSM hook after copying perf_event_attr perf: Fix use-after-free in error path perf/x86/amd: fix potential integer overflow on shift of a int perf/core: Fix cgroup events tracking perf core: Return error pointer if inherit_event() fails to find pmu_ctx
2023-01-01Merge tag 'x86_urgent_for_v6.2_rc2' of ↵Linus Torvalds3-25/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Two fixes to correct how kprobes handles INT3 now that they're added by other functionality like the rethunks and not only kgdb - Remove __init section markings of two functions which are referenced by a function in the .text section * tag 'x86_urgent_for_v6.2_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kprobes: Fix optprobe optimization check with CONFIG_RETHUNK x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNK x86/calldepth: Fix incorrect init section references
2022-12-28Merge branch 'kvm-late-6.1-fixes' into HEADPaolo Bonzini11-115/+164
x86: * several fixes to nested VMX execution controls * fixes and clarification to the documentation for Xen emulation * do not unnecessarily release a pmu event with zero period * MMU fixes * fix Coverity warning in kvm_hv_flush_tlb() selftests: * fixes for the ucall mechanism in selftests * other fixes mostly related to compilation with clang
2022-12-28KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESETPaolo Bonzini1-3/+27
While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running, if that happened it could cause a deadlock. This is due to kvm_xen_eventfd_reset() doing a synchronize_srcu() inside a kvm->lock critical section. To avoid this, first collect all the evtchnfd objects in an array and free all of them once the kvm->lock critical section is over and th SRCU grace period has expired. Reported-by: Michal Luczaj <mhal@rbox.co> Cc: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27x86/kprobes: Fix optprobe optimization check with CONFIG_RETHUNKMasami Hiramatsu (Google)1-20/+8
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after function return, kprobe jump optimization always fails on the functions with such INT3 inside the function body. (It already checks the INT3 padding between functions, but not inside the function) To avoid this issue, as same as kprobes, check whether the INT3 comes from kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051929.1374301.7419382929328081706.stgit@devnote3
2022-12-27x86/kprobes: Fix kprobes instruction boudary check with CONFIG_RETHUNKMasami Hiramatsu (Google)1-3/+7
Since the CONFIG_RETHUNK and CONFIG_SLS will use INT3 for stopping speculative execution after RET instruction, kprobes always failes to check the probed instruction boundary by decoding the function body if the probed address is after such sequence. (Note that some conditional code blocks will be placed after function return, if compiler decides it is not on the hot path.) This is because kprobes expects kgdb puts the INT3 as a software breakpoint and it will replace the original instruction. But these INT3 are not such purpose, it doesn't need to recover the original instruction. To avoid this issue, kprobes checks whether the INT3 is owned by kgdb or not, and if so, stop decoding and make it fail. The other INT3 will come from CONFIG_RETHUNK/CONFIG_SLS and those can be treated as a one-byte instruction. Fixes: e463a09af2f0 ("x86: Add straight-line-speculation mitigation") Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/167146051026.1374301.392728975473572291.stgit@devnote3
2022-12-27x86/calldepth: Fix incorrect init section referencesArnd Bergmann1-2/+2
The addition of callthunks_translate_call_dest means that skip_addr() and patch_dest() can no longer be discarded as part of the __init section freeing: WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> skip_addr (section: .init.text) WARNING: modpost: vmlinux.o: section mismatch in reference: callthunks_translate_call_dest.cold (section: .text.unlikely) -> patch_dest (section: .init.text) WARNING: modpost: vmlinux.o: section mismatch in reference: is_callthunk.cold (section: .text.unlikely) -> skip_addr (section: .init.text) ERROR: modpost: Section mismatches detected. Set CONFIG_SECTION_MISMATCH_WARN_ONLY=y to allow them. Fixes: b2e9dfe54be4 ("x86/bpf: Emit call depth accounting if required") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221215164334.968863-1-arnd@kernel.org
2022-12-27perf/x86/amd: fix potential integer overflow on shift of a intColin Ian King1-1/+1
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit arithmetic and then passed as a 64 bit function argument. In the case where i is 32 or more this can lead to an overflow. Avoid this by shifting using the BIT_ULL macro instead. Fixes: 471af006a747 ("perf/x86/amd: Constrain Large Increment per Cycle events") Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Ian Rogers <irogers@google.com> Acked-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20221202135149.1797974-1-colin.i.king@gmail.com
2022-12-27KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapiDavid Woodhouse1-7/+7
These are (uint64_t)-1 magic values are a userspace ABI, allowing the shared info pages and other enlightenments to be disabled. This isn't a Xen ABI because Xen doesn't let the guest turn these off except with the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is expected to handle soft reset, and tear down the kernel parts of the enlightenments accordingly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27KVM: x86/xen: Simplify eventfd IOCTLsMichal Luczaj1-7/+1
Port number is validated in kvm_xen_setattr_evtchn(). Remove superfluous checks in kvm_xen_eventfd_assign() and kvm_xen_eventfd_update(). Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20221222203021.1944101-3-mhal@rbox.co> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_portsPaolo Bonzini1-11/+18
The evtchnfd structure itself must be protected by either kvm->lock or SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and does not need kvm->lock, and is called in SRCU critical section from the kvm_x86_handle_exit function. It is also important to use rcu_read_{lock,unlock}() in kvm_xen_hcall_evtchn_send(), because idr_remove() will *not* use synchronize_srcu() to wait for readers to complete. Remove a superfluous if (kvm) check before calling synchronize_srcu() in kvm_xen_eventfd_deassign() where kvm has been dereferenced already. Co-developed-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badlyDavid Woodhouse1-38/+18
In particular, we shouldn't assume that being contiguous in guest virtual address space means being contiguous in guest *physical* address space. In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop the srcu_read_lock() that was around them. All call sites are reached from kvm_xen_hypercall() which is called from the handle_exit function with the read lock already held. 536395260 ("KVM: x86/xen: handle PV timers oneshot mode") 1a65105a5 ("KVM: x86/xen: handle PV spinlocks slowpath") Fixes: 2fd6df2f2 ("KVM: x86/xen: intercept EVTCHNOP_send from guests") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()Michal Luczaj1-3/+4
Release page irrespectively of kvm_vcpu_write_guest() return value. Suggested-by: Paul Durrant <paul@xen.org> Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled") Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20221220151454.712165-1-mhal@rbox.co> Reviewed-by: Paul Durrant <paul@xen.org> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20221226120320.1125390-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-27kvm: x86/mmu: Remove duplicated "be split" in spte.hLai Jiangshan1-1/+1
"be split be split" -> "be split" Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120505.9175-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/mmu: Don't install TDP MMU SPTE if SP has unexpected levelSean Christopherson1-1/+3
Don't install a leaf TDP MMU SPTE if the parent page's level doesn't match the target level of the fault, and instead have the vCPU retry the faulting instruction after warning. Continuing on is completely unnecessary as the absolute worst case scenario of retrying is DoSing the vCPU, whereas continuing on all but guarantees bigger explosions, e.g. ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage is disallowedSean Christopherson1-1/+2
Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock when adding a new shadow page in the TDP MMU. To ensure the NX reclaim kthread can't see a not-yet-linked shadow page, the page fault path links the new page table prior to adding the page to possible_nx_huge_pages. If the page is zapped by different task, e.g. because dirty logging is disabled, between linking the page and adding it to the list, KVM can end up triggering use-after-free by adding the zapped SP to the aforementioned list, as the zapped SP's memory is scheduled for removal via RCU callback. The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y, i.e. the below splat is just one possible signature. ------------[ cut here ]------------ list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38). WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0 Modules linked in: kvm_intel CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #71 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__list_add_valid+0x79/0xa0 RSP: 0018:ffffc900006efb68 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8 RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08 R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930 R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90 FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0 Call Trace: <TASK> track_possible_nx_huge_page+0x53/0x80 kvm_tdp_mmu_map+0x242/0x2c0 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 61f94478547b ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE") Reported-by: Greg Thelen <gthelen@google.com> Analyzed-by: David Matlack <dmatlack@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Ben Gardon <bgardon@google.com> Cc: Mingwei Zhang <mizhang@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/mmu: Map TDP MMU leaf SPTE iff target level is reachedSean Christopherson1-3/+11
Map the leaf SPTE when handling a TDP MMU page fault if and only if the target level is reached. A recent commit reworked the retry logic and incorrectly assumed that walking SPTEs would never "fail", as the loop either bails (retries) or installs parent SPs. However, the iterator itself will bail early if it detects a frozen (REMOVED) SPTE when stepping down. The TDP iterator also rereads the current SPTE before stepping down specifically to avoid walking into a part of the tree that is being removed, which means it's possible to terminate the loop without the guts of the loop observing the frozen SPTE, e.g. if a different task zaps a parent SPTE between the initial read and try_step_down()'s refresh. Mapping a leaf SPTE at the wrong level results in all kinds of badness as page table walkers interpret the SPTE as a page table, not a leaf, and walk into the weeds. ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510 Modules linked in: kvm_intel CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x481/0x510 RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027 RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48 R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0 R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Invalid SPTE change: cannot replace a present leaf SPTE with another present leaf SPTE mapping a different PFN! as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2 ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559! invalid opcode: 0000 [#1] SMP CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:__handle_changed_spte.cold+0x95/0x9c RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246 RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027 RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8 RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0 R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3 R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01 FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_mmu_map+0x3b0/0x510 kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> Modules linked in: kvm_intel ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213033030.83345-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86/mmu: Don't attempt to map leaf if target TDP MMU SPTE is frozenSean Christopherson1-3/+3
Hoist the is_removed_spte() check above the "level == goal_level" check when walking SPTEs during a TDP MMU page fault to avoid attempting to map a leaf entry if said entry is frozen by a different task/vCPU. ------------[ cut here ]------------ WARNING: CPU: 3 PID: 939 at arch/x86/kvm/mmu/tdp_mmu.c:653 kvm_tdp_mmu_map+0x269/0x4b0 Modules linked in: kvm_intel CPU: 3 PID: 939 Comm: nx_huge_pages_t Not tainted 6.1.0-rc4+ #67 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_tdp_mmu_map+0x269/0x4b0 RSP: 0018:ffffc9000068fba8 EFLAGS: 00010246 RAX: 00000000000005a0 RBX: ffffc9000068fcc0 RCX: 0000000000000005 RDX: ffff88810741f000 RSI: ffff888107f04600 RDI: ffffc900006a3000 RBP: 060000010b000bf3 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000ffffffffff000 R12: 0000000000000005 R13: ffff888113670000 R14: ffff888107464958 R15: 0000000000000000 FS: 00007f01c942c740(0000) GS:ffff888277cc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000117013006 CR4: 0000000000172ea0 Call Trace: <TASK> kvm_tdp_page_fault+0x10c/0x130 kvm_mmu_page_fault+0x103/0x680 vmx_handle_exit+0x132/0x5a0 [kvm_intel] vcpu_enter_guest+0x60c/0x16f0 kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> ---[ end trace 0000000000000000 ]--- Fixes: 63d28a25e04c ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry") Cc: Robert Hoo <robert.hu@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Robert Hoo <robert.hu@linux.intel.com> Message-Id: <20221213033030.83345-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Don't stuff secondary execution control if it's not supportedSean Christopherson1-0/+7
When stuffing the allowed secondary execution controls for nested VMX in response to CPUID updates, don't set the allowed-1 bit for a feature that isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config. WARN if KVM attempts to manipulate a feature that isn't supported. All features that are currently stuffed are always advertised to L1 for nested VMX if they are supported in KVM's base configuration, and no additional features should ever be added to the CPUID-induced stuffing (updating VMX MSRs in response to CPUID updates is a long-standing KVM flaw that is slowly being fixed). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221213062306.667649-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Properly expose ENABLE_USR_WAIT_PAUSE control to L1Sean Christopherson1-1/+2
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the feature is supported in hardware and enabled in KVM's base, non-nested configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported. This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and obviously allows L1 to enable the feature for L2. KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when updating secondary controls in response to KVM_SET_CPUID(2), but (a) that depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction that the guest value must be a strict subset of the supported host value. Although no past commit explicitly enabled nested support for WAITPKG, doing so is safe and functionally correct from an architectural perspective as no additional KVM support is needed to virtualize TPAUSE, UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards VM-Exits to L1 as necessary (commit bf653b78f960, "KVM: vmx: Introduce handle_unexpected_vmexit and handle WAITPKG vmexit"). Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in hardware, i.e. always runs both L1 and L2 with the host's power management settings for TPAUSE and UMWAIT. See commit bf09fb6cba4f ("KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL") for more details. Fixes: e69e72faa3a0 ("KVM: x86: Add support for user wait instructions") Cc: stable@vger.kernel.org Reported-by: Aaron Lewis <aaronlewis@google.com> Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20221213062306.667649-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: nVMX: Document that ignoring memory failures for VMCLEAR is deliberateSean Christopherson1-4/+13
Explicitly drop the result of kvm_vcpu_write_guest() when writing the "launch state" as part of VMCLEAR emulation, and add a comment to call out that KVM's behavior is architecturally valid. Intel's pseudocode effectively says that VMCLEAR is a nop if the target VMCS address isn't in memory, e.g. if the address points at MMIO. Add a FIXME to call out that suppressing failures on __copy_to_user() is wrong, as memory (a memslot) does exist in that case. Punt the issue to the future as open coding kvm_vcpu_write_guest() just to make sure the guest dies with -EFAULT isn't worth the extra complexity. The flaw will need to be addressed if KVM ever does something intelligent on uaccess failures, e.g. to support post-copy demand paging, but in that case KVM will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open code a core KVM helper. No functional change intended. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527765 ("Error handling issues") Fixes: 587d7e72aedc ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down") Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221220154224.526568-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: Sanity check inputs to kvm_handle_memory_failure()Sean Christopherson1-0/+3
Add a sanity check in kvm_handle_memory_failure() to assert that a valid x86_exception structure is provided if the memory "failure" wants to propagate a fault into the guest. If a memory failure happens during a direct guest physical memory access, e.g. for nested VMX, KVM hardcodes the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer (because the exception struct would just be filled with garbage). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221220153427.514032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: Simplify kvm_apic_hw_enabledPeng Hao1-2/+2
kvm_apic_hw_enabled() only needs to return bool, there is no place to use the return value of MSR_IA32_APICBASE_ENABLE. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: hyper-v: Fix 'using uninitialized value' Coverity warningVitaly Kuznetsov1-27/+36
In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM halves' and increase 'data_offset' otherwise. Coverity discovered, that in one case both variables are incremented unconditionally. This doesn't seem to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves' data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also takes into account 'hc->fast' but is still worth fixing. To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to 'struct kvm_hv_hcall' as a union and use at call sites. This allows to remove explicit 'data_offset'/'consumed_xmm_halves' parameters from kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries() helpers. Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and is not zeroed, consumers are supposed to initialize the appropriate field if needed. Reported-by: coverity-bot <keescook+coverity-bot@chromium.org> Addresses-Coverity-ID: 1527764 ("Uninitialized variables") Fixes: 260970862c88 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221208102700.959630-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-23KVM: x86: ioapic: Fix level-triggered EOI and userspace I/OAPIC reconfigure raceAdamos Ttofari1-2/+3
When scanning userspace I/OAPIC entries, intercept EOI for level-triggered IRQs if the current vCPU has a pending and/or in-service IRQ for the vector in its local API, even if the vCPU doesn't match the new entry's destination. This fixes a race between userspace I/OAPIC reconfiguration and IRQ delivery that results in the vector's bit being left set in the remote IRR due to the eventual EOI not being forwarded to the userspace I/OAPIC. Commit 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") fixed the in-kernel IOAPIC, but not the userspace IOAPIC configuration, which has a similar race. Fixes: 0fc5a36dd6b3 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221208094415.12723-1-attofari@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>