summaryrefslogtreecommitdiff
path: root/arch
AgeCommit message (Collapse)AuthorFilesLines
2021-08-03arm64: stacktrace: fix commentMark Rutland1-1/+1
Due to a copy-paste error, we describe struct stackframe::pc as a snapshot of the `fp` field rather than the `lr` field. Fix the comment. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Madhavan T. Venkataraman <madvenka@linux.microsoft.com> Cc: Mark Brown <broonie@kernel.org> Cc: Will Deacon <will@kernel.org> Reviewed-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20210802164845.45506-2-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-08-03arm64: fix the doc of RANDOMIZE_MODULE_REGION_FULLBarry Song2-4/+9
Obviously kaslr is setting the module region to 2GB rather than 4GB since commit b2eed9b588112 ("arm64/kernel: kaslr: reduce module randomization range to 2 GB"). So fix the size of region in Kconfig. On the other hand, even though RANDOMIZE_MODULE_REGION_FULL is not set, module_alloc() can fall back to a 2GB window if ARM64_MODULE_PLTS is set. In this case, veneers are still needed. !RANDOMIZE_MODULE_REGION_FULL doesn't necessarily mean veneers are not needed. So fix the doc to be more precise to avoid any confusion to the readers of the code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@arm.com> Cc: Qi Liu <liuqi115@huawei.com> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20210730125131.13724-1-song.bao.hua@hisilicon.com Signed-off-by: Will Deacon <will@kernel.org>
2021-08-03arm64: move warning about toolchains to archprepareMasahiro Yamada1-9/+12
Commit 987fdfec2410 ("arm64: move --fix-cortex-a53-843419 linker test to Kconfig") fixed the false-positive warning in the installation step. Yet, there are some cases where this false-positive is shown. For example, you can see it when you cross 987fdfec2410 during git-bisect. $ git checkout 987fdfec2410^ [ snip ] $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig all [ snip ] $ git checkout v5.13 [ snip] $ make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- defconfig all [ snip ] arch/arm64/Makefile:25: ld does not support --fix-cortex-a53-843419; kernel may be susceptible to erratum In the stale include/config/auto.config, CONFIG_ARM64_ERRATUM_843419=y is set without CONFIG_ARM64_LD_HAS_FIX_ERRATUM_843419, so the warning is displayed while parsing the Makefiles. Make will restart with the updated include/config/auto.config, hence CONFIG_ARM64_LD_HAS_FIX_ERRATUM_843419 will be set eventually, but this warning is a surprise for users. Commit 25896d073d8a ("x86/build: Fix compiler support check for CONFIG_RETPOLINE") addressed a similar issue. Move $(warning ...) out of the parse stage of Makefiles. The same applies to CONFIG_ARM64_USE_LSE_ATOMICS. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20210801053525.105235-1-masahiroy@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
2021-08-03arm64: fix compat syscall return truncationMark Rutland5-18/+27
Due to inconsistencies in the way we manipulate compat GPRs, we have a few issues today: * For audit and tracing, where error codes are handled as a (native) long, negative error codes are expected to be sign-extended to the native 64-bits, or they may fail to be matched correctly. Thus a syscall which fails with an error may erroneously be identified as failing. * For ptrace, *all* compat return values should be sign-extended for consistency with 32-bit arm, but we currently only do this for negative return codes. * As we may transiently set the upper 32 bits of some compat GPRs while in the kernel, these can be sampled by perf, which is somewhat confusing. This means that where a syscall returns a pointer above 2G, this will be sign-extended, but will not be mistaken for an error as error codes are constrained to the inclusive range [-4096, -1] where no user pointer can exist. To fix all of these, we must consistently use helpers to get/set the compat GPRs, ensuring that we never write the upper 32 bits of the return code, and always sign-extend when reading the return code. This patch does so, with the following changes: * We re-organise syscall_get_return_value() to always sign-extend for compat tasks, and reimplement syscall_get_error() atop. We update syscall_trace_exit() to use syscall_get_return_value(). * We consistently use syscall_set_return_value() to set the return value, ensureing the upper 32 bits are never set unexpectedly. * As the core audit code currently uses regs_return_value() rather than syscall_get_return_value(), we special-case this for compat_user_mode(regs) such that this will do the right thing. Going forward, we should try to move the core audit code over to syscall_get_return_value(). Cc: <stable@vger.kernel.org> Reported-by: He Zhe <zhe.he@windriver.com> Reported-by: weiyuchen <weiyuchen3@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/20210802104200.21390-1-mark.rutland@arm.com Signed-off-by: Will Deacon <will@kernel.org>
2021-08-03ARM: dts: aspeed-g5: Remove ngpios from sgpio node.Steven Lee1-1/+0
Remove ngpios property from sgpio node as it should be defined in the platform dts. Signed-off-by: Steven Lee <steven_lee@aspeedtech.com> Reviewed-by: Andrew Jeffery <andrew@aj.id.au> Link: https://lore.kernel.org/r/20210712100317.23298-5-steven_lee@aspeedtech.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-08-03ARM: dts: aspeed-g6: Add SGPIO node.Steven Lee1-0/+28
AST2600 supports 2 SGPIO master interfaces one with 128 pins another one with 80 pins. Signed-off-by: Steven Lee <steven_lee@aspeedtech.com> Link: https://lore.kernel.org/r/20210712100317.23298-4-steven_lee@aspeedtech.com Signed-off-by: Joel Stanley <joel@jms.id.au>
2021-08-02block: remove cmdline-parser.cChristoph Hellwig1-1/+0
cmdline-parser.c is only used by the cmdline faux partition format, so merge the code into that and avoid an indirect call. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210728053756.409654-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02MIPS: don't include <linux/genhd.h> in <asm/mach-rc32434/rb.h>Christoph Hellwig1-2/+0
There is no need to include genhd.h from a random arch header, and not doing so prevents the possibility for nasty include loops. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20210727055646.118787-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02arm64: kasan: mte: remove redundant mte_report_once logicMark Rutland4-39/+1
We have special logic to suppress MTE tag check fault reporting, based on a global `mte_report_once` and `reported` variables. These can be used to suppress calling kasan_report() when taking a tag check fault, but do not prevent taking the fault in the first place, nor does they affect the way we disable tag checks upon taking a fault. The core KASAN code already defaults to reporting a single fault, and has a `multi_shot` control to permit reporting multiple faults. The only place we transiently alter `mte_report_once` is in lib/test_kasan.c, where we also the `multi_shot` state as the same time. Thus `mte_report_once` and `reported` are redundant, and can be removed. When a tag check fault is taken, tag checking will be disabled by `do_tag_recovery` and must be explicitly re-enabled if desired. The test code does this by calling kasan_enable_tagging_sync(). This patch removes the redundant mte_report_once() logic and associated variables. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Will Deacon <will@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20210714143843.56537-4-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-08-02arm64: kasan: mte: use a constant kernel GCR_EL1 valueMark Rutland8-49/+19
When KASAN_HW_TAGS is selected, KASAN is enabled at boot time, and the hardware supports MTE, we'll initialize `kernel_gcr_excl` with a value dependent on KASAN_TAG_MAX. While the resulting value is a constant which depends on KASAN_TAG_MAX, we have to perform some runtime work to generate the value, and have to read the value from memory during the exception entry path. It would be better if we could generate this as a constant at compile-time, and use it as such directly. Early in boot within __cpu_setup(), we initialize GCR_EL1 to a safe value, and later override this with the value required by KASAN. If CONFIG_KASAN_HW_TAGS is not selected, or if KASAN is disabeld at boot time, the kernel will not use IRG instructions, and so the initial value of GCR_EL1 is does not matter to the kernel. Thus, we can instead have __cpu_setup() initialize GCR_EL1 to a value consistent with KASAN_TAG_MAX, and avoid the need to re-initialize it during hotplug and resume form suspend. This patch makes arem64 use a compile-time constant KERNEL_GCR_EL1 value, which is compatible with KASAN_HW_TAGS when this is selected. This removes the need to re-initialize GCR_EL1 dynamically, and acts as an optimization to the entry assembly, which no longer needs to load this value from memory. The redundant initialization hooks are removed. In order to do this, KASAN_TAG_MAX needs to be visible outside of the core KASAN code. To do this, I've moved the KASAN_TAG_* values into <linux/kasan-tags.h>. There should be no functional change as a result of this patch. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@gmail.com> Link: https://lore.kernel.org/r/20210714143843.56537-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-08-02arm64/sve: Make fpsimd_bind_task_to_cpu() staticMark Brown2-2/+3
This function is not referenced outside fpsimd.c so can be static, making it that little bit easier to follow what is called from where. Signed-off-by: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/r/20210730165846.18558-1-broonie@kernel.org Reviewed-by: Dave Martin <Dave.Martin@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2021-08-02KVM: nSVM: remove useless kvm_clear_*_queuePaolo Bonzini1-5/+0
For an event to be in injected state when nested_svm_vmrun executes, it must have come from exitintinfo when svm_complete_interrupts ran: vcpu_enter_guest static_call(kvm_x86_run) -> svm_vcpu_run svm_complete_interrupts // now the event went from "exitintinfo" to "injected" static_call(kvm_x86_handle_exit) -> handle_exit svm_invoke_exit_handler vmrun_interception nested_svm_vmrun However, no event could have been in exitintinfo before a VMRUN vmexit. The code in svm.c is a bit more permissive than the one in vmx.c: if (is_external_interrupt(svm->vmcb->control.exit_int_info) && exit_code != SVM_EXIT_EXCP_BASE + PF_VECTOR && exit_code != SVM_EXIT_NPF && exit_code != SVM_EXIT_TASK_SWITCH && exit_code != SVM_EXIT_INTR && exit_code != SVM_EXIT_NMI) but in any case, a VMRUN instruction would not even start to execute during an attempted event delivery. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Preserve guest's CR0.CD/NW on INITSean Christopherson1-1/+13
Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as defined by both Intel's SDM and AMD's APM. Note, current versions of Intel's SDM are very poorly written with respect to INIT behavior. Table 9-1. "IA-32 and Intel 64 Processor States Following Power-up, Reset, or INIT" quite clearly lists power-up, RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1. But the SDM then attempts to qualify CD/NW behavior in a footnote: 2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits are cleared. Presumably that footnote is only meant for INIT, as the RESET case and especially the power-up case are rather non-sensical. Another footnote all but confirms that: 6. Internal caches are invalid after power-up and RESET, but left unchanged with an INIT. Bare metal testing shows that CD/NW are indeed preserved on INIT (someone else can hack their BIOS to check RESET and power-up :-D). Reported-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Drop redundant clearing of vcpu->arch.hflags at INIT/RESETSean Christopherson1-3/+0
Drop redundant clears of vcpu->arch.hflags in init_vmcb() since kvm_vcpu_reset() always clears hflags, and it is also always zero at vCPU creation time. And of course, the second clearing in init_vmcb() was always redundant. Suggested-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Emulate #INIT in response to triple fault shutdownSean Christopherson2-3/+8
Emulate a full #INIT instead of simply initializing the VMCB if the guest hits a shutdown. Initializing the VMCB but not other vCPU state, much of which is mirrored by the VMCB, results in incoherent and broken vCPU state. Ideally, KVM would not automatically init anything on shutdown, and instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force userspace to explicitly INIT or RESET the vCPU. Even better would be to add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown (and SMI on Intel CPUs). But, that ship has sailed, and emulating #INIT is the next best thing as that has at least some connection with reality since there exist bare metal platforms that automatically INIT the CPU if it hits shutdown. Fixes: 46fe4ddd9dbb ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Move RESET-only VMWRITE sequences to init_vmcs()Sean Christopherson1-15/+13
Move VMWRITE sequences in vmx_vcpu_reset() guarded by !init_event into init_vmcs() to make it more obvious that they're, uh, initializing the VMCS. No meaningful functional change intended (though the order of VMWRITEs and whatnot is different). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Remove redundant write to set vCPU as active at RESET/INITSean Christopherson1-2/+0
Drop a call to vmx_clear_hlt() during vCPU INIT, the guest's activity state is unconditionally set to "active" a few lines earlier in vmx_vcpu_reset(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Smush x2APIC MSR bitmap adjustments into single functionSean Christopherson2-35/+22
Consolidate all of the dynamic MSR bitmap adjustments into vmx_update_msr_bitmap_x2apic(), and rename the mode tracker to reflect that it is x2APIC specific. If KVM gains more cases of dynamic MSR pass-through, odds are very good that those new cases will be better off with their own logic, e.g. see Intel PT MSRs and MSR_IA32_SPEC_CTRL. Attempting to handle all updates in a common helper did more harm than good, as KVM ended up collecting a large number of useless "updates". Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Remove unnecessary initialization of msr_bitmap_modeSean Christopherson1-1/+0
Don't bother initializing msr_bitmap_mode to 0, all of struct vcpu_vmx is zero initialized. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Don't redo x2APIC MSR bitmaps when userspace filter is changedSean Christopherson1-1/+0
Drop an explicit call to update the x2APIC MSRs when the userspace MSR filter is modified. The x2APIC MSRs are deliberately exempt from userspace filtering. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: nVMX: Remove obsolete MSR bitmap refresh at nested transitionsSean Christopherson3-8/+1
Drop unnecessary MSR bitmap updates during nested transitions, as L1's APIC_BASE MSR is not modified by the standard VM-Enter/VM-Exit flows, and L2's MSR bitmap is managed separately. In the unlikely event that L1 is pathological and loads APIC_BASE via the VM-Exit load list, KVM will handle updating the bitmap in its normal WRMSR flows. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-39-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Remove obsolete MSR bitmap refresh at vCPU RESET/INITSean Christopherson1-3/+0
Remove an unnecessary MSR bitmap refresh during vCPU RESET/INIT. In both cases, the MSR bitmap already has the desired values and state. At RESET, the vCPU is guaranteed to be running with x2APIC disabled, the x2APIC MSRs are guaranteed to be intercepted due to the MSR bitmap being initialized to all ones by alloc_loaded_vmcs(), and vmx->msr_bitmap_mode is guaranteed to be zero, i.e. reflecting x2APIC disabled. At INIT, the APIC_BASE MSR is not modified, thus there can't be any change in x2APIC state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-38-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86Sean Christopherson3-15/+8
Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to common x86. VMX and SVM now have near-identical sequences, the only difference being that VMX updates the exception bitmap. Updating the bitmap on SVM is unnecessary, but benign. Unfortunately it can't be left behind in VMX due to the need to update exception intercepts after the control registers are set. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Don't _explicitly_ reconfigure user return MSRs on vCPU INITSean Christopherson1-2/+2
When emulating vCPU INIT, do not unconditionally refresh the list of user return MSRs that need to be loaded into hardware when running the guest. Unconditionally refreshing the list is confusing, as the vast majority of MSRs are not modified on INIT. The real motivation is to handle the case where an INIT during long mode obviates the need to load the SYSCALL MSRs, and that is handled as needed by vmx_set_efer(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-36-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Refresh list of user return MSRs after setting guest CPUIDSean Christopherson1-0/+2
After a CPUID update, refresh the list of user return MSRs that are loaded into hardware when running the vCPU. This is necessary to handle the oddball case where userspace exposes X86_FEATURE_RDTSCP to the guest after the vCPU is running. Fixes: 0023ef39dc35 ("kvm: vmx: Set IA32_TSC_AUX for legacy mode guests") Fixes: 4e47c7a6d714 ("KVM: VMX: Add instruction rdtscp support for guest") Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Skip pointless MSR bitmap update when setting EFERSean Christopherson1-9/+10
Split setup_msrs() into vmx_setup_uret_msrs() and an open coded refresh of the MSR bitmap, and skip the latter when refreshing the user return MSRs during an EFER load. Only the x2APIC MSRs are dynamically exposed and hidden, and those are not affected by a change in EFER. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Stuff save->dr6 at during VMSA sync, not at RESET/INITSean Christopherson2-1/+1
Move code to stuff vmcb->save.dr6 to its architectural init value from svm_vcpu_reset() into sev_es_sync_vmsa(). Except for protected guests, a.k.a. SEV-ES guests, vmcb->save.dr6 is set during VM-Enter, i.e. the extra write is unnecessary. For SEV-ES, stuffing save->dr6 handles a theoretical case where the VMSA could be encrypted before the first KVM_RUN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Drop redundant writes to vmcb->save.cr4 at RESET/INITSean Christopherson1-3/+0
Drop direct writes to vmcb->save.cr4 during vCPU RESET/INIT, as the values being written are fully redundant with respect to svm_set_cr4(vcpu, 0) a few lines earlier. Note, svm_set_cr4() also correctly forces X86_CR4_PAE when NPT is disabled. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Tweak order of cr0/cr4/efer writes at RESET/INITSean Christopherson1-6/+1
Hoist svm_set_cr0() up in the sequence of register initialization during vCPU RESET/INIT, purely to match VMX so that a future patch can move the sequences to common x86. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: nVMX: Don't evaluate "emulation required" on nested VM-ExitSean Christopherson3-13/+11
Use the "internal" variants of setting segment registers when stuffing state on nested VM-Exit in order to skip the "emulation required" updates. VM-Exit must always go to protected mode, and all segments are mostly hardcoded (to valid values) on VM-Exit. The bits of the segments that aren't hardcoded are explicitly checked during VM-Enter, e.g. the selector RPLs must all be zero. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Skip emulation required checks during pmode/rmode transitionsSean Christopherson1-6/+12
Don't refresh "emulation required" when stuffing segments during transitions to/from real mode when running without unrestricted guest. The checks are unnecessary as vmx_set_cr0() unconditionally rechecks "emulation required". They also happen to be broken, as enter_pmode() and enter_rmode() run with a stale vcpu->arch.cr0. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Process CR0.PG side effects after setting CR0 assetsSean Christopherson1-11/+12
Move the long mode and EPT w/o unrestricted guest side effect processing down in vmx_set_cr0() so that the EPT && !URG case doesn't have to stuff vcpu->arch.cr0 early. This also fixes an oddity where CR0 might not be marked available, i.e. the early vcpu->arch.cr0 write would appear to be in danger of being overwritten, though that can't actually happen in the current code since CR0.TS is the only guest-owned bit, and CR0.TS is not read by vmx_set_cr4(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0Sean Christopherson1-3/+3
Skip the MMU permission_fault() check if paging is disabled when verifying the cached MMIO GVA is usable. The check is unnecessary and can theoretically get a false positive since the MMU doesn't zero out "permissions" or "pkru_mask" when guest paging is disabled. The obvious alternative is to zero out all the bitmasks when configuring nonpaging MMUs, but that's unnecessary work and doesn't align with the MMU's general approach of doing as little as possible for flows that are supposed to be unreachable. This is nearly a nop as the false positive is nothing more than an insignificant performance blip, and more or less limited to string MMIO when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is active with nested TDP since the "GVA" is really an L2 GPA. If L2 is active without nested TDP, then paging can't be disabled as neither VMX nor SVM allows entering the guest without paging of some form. Jumping back to L1 with paging disabled, in that case direct_map is true and so KVM will use CR2 as a GPA; the only time it doesn't is if the fault from the emulator doesn't match or emulator_can_use_gpa(), and that fails only on string MMIO and other instructions with multiple memory operands. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Pull GUEST_CR3 from the VMCS iff CR3 load exiting is disabledSean Christopherson1-2/+5
Tweak the logic for grabbing vmcs.GUEST_CR3 in vmx_cache_reg() to look directly at the execution controls, as opposed to effectively inferring the controls based on vCPUs. Inferring the controls isn't wrong, but it creates a very subtle dependency between the caching logic, the state of vcpu->arch.cr0 (via is_paging()), and the behavior of vmx_set_cr0(). Using the execution controls doesn't completely eliminate the dependency in vmx_set_cr0(), e.g. neglecting to cache CR3 before enabling interception would still break the guest, but it does reduce the code dependency and mostly eliminate the logical dependency (that CR3 loads are intercepted in certain scenarios). Eliminating the subtle read of vcpu->arch.cr0 will also allow for additional cleanup in vmx_set_cr0(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: nVMX: Do not clear CR3 load/store exiting bits if L1 wants 'emSean Christopherson1-9/+37
Keep CR3 load/store exiting enable as needed when running L2 in order to honor L1's desires. This fixes a largely theoretical bug where L1 could intercept CR3 but not CR0.PG and end up not getting the desired CR3 exits when L2 enables paging. In other words, the existing !is_paging() check inadvertantly handles the normal case for L2 where vmx_set_cr0() is called during VM-Enter, which is guaranteed to run with paging enabled, and thus will never clear the bits. Removing the !is_paging() check will also allow future consolidation and cleanup of the related code. From a performance perspective, this is all a nop, as the VMCS controls shadow will optimize away the VMWRITE when the controls are in the desired state. Add a comment explaining why CR3 is intercepted, with a big disclaimer about not querying the old CR3. Because vmx_set_cr0() is used for flows that are not directly tied to MOV CR3, e.g. vCPU RESET/INIT and nested VM-Enter, it's possible that is_paging() is not synchronized with CR3 load/store exiting. This is actually guaranteed in the current code, as KVM starts with CR3 interception disabled. Obviously that can be fixed, but there's no good reason to play whack-a-mole, and it tends to end poorly, e.g. descriptor table exiting for UMIP emulation attempted to be precise in the past and ended up botching the interception toggling. Fixes: fe3ef05c7572 ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Fold ept_update_paging_mode_cr0() back into vmx_set_cr0()Sean Christopherson1-23/+17
Move the CR0/CR3/CR4 shenanigans for EPT without unrestricted guest back into vmx_set_cr0(). This will allow a future patch to eliminate the rather gross stuffing of vcpu->arch.cr0 in the paging transition cases by snapshotting the old CR0. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Remove direct write to vcpu->arch.cr0 during vCPU RESET/INITSean Christopherson1-4/+1
Remove a bogus write to vcpu->arch.cr0 that immediately precedes vmx_set_cr0() during vCPU RESET/INIT. For RESET, this is a nop since the "old" CR0 value is meaningless. But for INIT, if the vCPU is coming from paging enabled mode, crushing vcpu->arch.cr0 will cause the various is_paging() checks in vmx_set_cr0() to get false negatives. For the exit_lmode() case, the false negative is benign as vmx_set_efer() is called immediately after vmx_set_cr0(). For EPT without unrestricted guest, the false negative will cause KVM to unnecessarily run with CR3 load/store exiting. But again, this is benign, albeit sub-optimal. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Invert handling of CR0.WP for EPT without unrestricted guestSean Christopherson1-9/+5
Opt-in to forcing CR0.WP=1 for shadow paging, and stop lying about WP being "always on" for unrestricted guest. In addition to making KVM a wee bit more honest, this paves the way for additional cleanup. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Don't bother writing vmcb->save.rip at vCPU RESET/INITSean Christopherson1-2/+1
Drop unnecessary initialization of vmcb->save.rip during vCPU RESET/INIT, as svm_vcpu_run() unconditionally propagates VCPU_REGS_RIP to save.rip. No true functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Move EDX initialization at vCPU RESET to common codeSean Christopherson4-24/+13
Move the EDX initialization at vCPU RESET, which is now identical between VMX and SVM, into common code. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Consolidate APIC base RESET initialization codeSean Christopherson3-18/+7
Consolidate the APIC base RESET logic, which is currently spread out across both x86 and vendor code. For an in-kernel APIC, the vendor code is redundant. But for a userspace APIC, KVM relies on the vendor code to initialize vcpu->arch.apic_base. Hoist the vcpu->arch.apic_base initialization above the !apic check so that it applies to both flavors of APIC emulation, and delete the vendor code. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Open code necessary bits of kvm_lapic_set_base() at vCPU RESETSean Christopherson1-9/+6
Stuff vcpu->arch.apic_base and apic->base_address directly during APIC reset, as opposed to bouncing through kvm_set_apic_base() while fudging the ENABLE bit during creation to avoid the other, unwanted side effects. This is a step towards consolidating the APIC RESET logic across x86, VMX, and SVM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Stuff vcpu->arch.apic_base directly at vCPU RESETSean Christopherson1-6/+3
Write vcpu->arch.apic_base directly instead of bouncing through kvm_set_apic_base(). This is a glorified nop, and is a step towards cleaning up the mess that is local APIC creation. When using an in-kernel APIC, kvm_create_lapic() explicitly sets vcpu->arch.apic_base to MSR_IA32_APICBASE_ENABLE to avoid its own kvm_lapic_set_base() call in kvm_lapic_reset() from triggering state changes. That call during RESET exists purely to set apic->base_address to the default base value. As a result, by the time VMX gets control, the only missing piece is the BSP bit being set for the reset BSP. For a userspace APIC, there are no side effects to process (for the APIC). In both cases, the call to kvm_update_cpuid_runtime() is a nop because the vCPU hasn't yet been exposed to userspace, i.e. there can't be any CPUID entries. No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Set BSP bit in reset BSP vCPU's APIC base by defaultSean Christopherson1-2/+5
Set the BSP bit appropriately during local APIC "reset" instead of relying on vendor code to clean up at a later point. This is a step towards consolidating the local APIC, VMX, and SVM xAPIC initialization code. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Don't force set BSP bit when local APIC is managed by userspaceSean Christopherson1-3/+0
Don't set the BSP bit in vcpu->arch.apic_base when the local APIC is managed by userspace. Forcing all vCPUs to be BSPs is non-sensical, and was dead code when it was added by commit 97222cc83163 ("KVM: Emulate local APIC in kernel"). At the time, kvm_lapic_set_base() was invoked if and only if the local APIC was in-kernel (and it couldn't be called before the vCPU created its APIC). kvm_lapic_set_base() eventually gained generic usage, but the latent bug escaped notice because the only true consumer would be the guest itself in the form of an explicit RDMSRs on APs. Out of Linux, SeaBIOS, and EDK2/OVMF, only OVMF consumes the BSP bit from the APIC_BASE MSR. For the vast majority of usage in OVMF, BSP confusion would be benign. OVMF's BSP election upon SMI rendezvous might be broken, but practically no one runs KVM with an out-of-kernel local APIC, let alone does so while utilizing SMIs with OVMF. Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel") Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSPSean Christopherson1-1/+2
Make vcpu0 the arbitrary owner of the PIT, as was intended when PIT migration was added by commit 2f5997140f22 ("KVM: migrate PIT timer"). The PIT was unintentionally turned into being owned by the BSP by commit c5af89b68abb ("KVM: Introduce kvm_vcpu_is_bsp() function."), and was then unintentionally converted to a shared ownership model when kvm_vcpu_is_bsp() was modified to check the APIC base MSR instead of hardcoding vcpu0 as the BSP. Functionally, this just means the PIT's hrtimer is migrated less often. The real motivation is to remove the usage of kvm_vcpu_is_bsp(), so that more legacy/broken crud can be removed in a future patch. Fixes: 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: Remove defunct BSP "update" in local APIC resetSean Christopherson1-3/+1
Remove a BSP APIC update in kvm_lapic_reset() that is a glorified and confusing nop. When the code was originally added, kvm_vcpu_is_bsp() queried kvm->arch.bsp_vcpu, i.e. the intent was to set the BSP bit in the BSP vCPU's APIC. But, stuffing the BSP bit at INIT was wrong since the guest can change its BSP(s); this was fixed by commit 58d269d8cccc ("KVM: x86: BSP in MSR_IA32_APICBASE is writable"). In other words, kvm_vcpu_is_bsp() is now purely a reflection of vcpu->arch.apic_base.MSR_IA32_APICBASE_BSP, thus the update will always set the current value and kvm_lapic_set_base() is effectively a nop if the new and old values match. The RESET case, which does need to stuff the BSP for the reset vCPU, is handled by vendor code (though this will soon be moved to common code). No functional change intended. Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: x86: WARN if the APIC map is dirty without an in-kernel local APICSean Christopherson1-0/+3
WARN if KVM ends up in a state where it thinks its APIC map needs to be recalculated, but KVM is not emulating the local APIC. This is mostly to document KVM's "rules" in order to provide clarity in future cleanups. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: SVM: Drop explicit MMU reset at RESET/INITSean Christopherson1-1/+0
Drop an explicit MMU reset in SVM's vCPU RESET/INIT flow now that the common x86 path correctly handles conditional MMU resets, e.g. if INIT arrives while the vCPU is in 64-bit mode. This reverts commit ebae871a509d ("kvm: svm: reset mmu on VCPU reset"). Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-08-02KVM: VMX: Remove explicit MMU reset in enter_rmode()Sean Christopherson1-2/+0
Drop an explicit MMU reset when entering emulated real mode now that the vCPU INIT/RESET path correctly handles conditional MMU resets, e.g. if INIT arrives while the vCPU is in 64-bit mode. Note, while there are multiple other direct calls to vmx_set_cr0(), i.e. paths that change CR0 without invoking kvm_post_set_cr0(), only the INIT emulation can reach enter_rmode(). CLTS emulation only toggles CR.TS, VM-Exit (and late VM-Fail) emulation cannot architecturally transition to Real Mode, and VM-Enter to Real Mode is possible if and only if Unrestricted Guest is enabled (exposed to L1). This effectively reverts commit 8668a3c468ed ("KVM: VMX: Reset mmu context when entering real mode") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210713163324.627647-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>