summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2021-02-06entry: Ensure trap after single-step on system call returnGabriel Krisman Bertazi2-4/+8
Commit 299155244770 ("entry: Drop usage of TIF flags in the generic syscall code") introduced a bug on architectures using the generic syscall entry code, in which processes stopped by PTRACE_SYSCALL do not trap on syscall return after receiving a TIF_SINGLESTEP. The reason is that the meaning of TIF_SINGLESTEP flag is overloaded to cause the trap after a system call is executed, but since the above commit, the syscall call handler only checks for the SYSCALL_WORK flags on the exit work. Split the meaning of TIF_SINGLESTEP such that it only means single-step mode, and create a new type of SYSCALL_WORK to request a trap immediately after a syscall in single-step mode. In the current implementation, the SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag for simplicity. Update x86 to flip this bit when a tracer enables single stepping. Fixes: 299155244770 ("entry: Drop usage of TIF flags in the generic syscall code") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Kyle Huey <me@kylehuey.com> Link: https://lore.kernel.org/r/87h7mtc9pr.fsf_-_@collabora.com
2021-02-05x86/debug: Prevent data breakpoints on cpu_dr7Lai Jiangshan1-0/+8
local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and disables breakpoints to prevent recursion. When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the per-cpu variable cpu_dr7 to check whether a breakpoint is active or not before it accesses DR7. A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion. Disallow data breakpoints on cpu_dr7 to prevent that. Fixes: 84b6a3491567a("x86/entry: Optimize local_db_save() for virt") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
2021-02-05x86/debug: Prevent data breakpoints on __per_cpu_offsetLai Jiangshan1-0/+14
When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value via __per_cpu_offset or pcpu_unit_offsets. When a data breakpoint is set on __per_cpu_offset[cpu] (read-write operation), the specific CPU will be stuck in an infinite #DB loop. RCU will try to send an NMI to the specific CPU, but it is not working either since NMI also relies on paranoid_entry(). Which means it's undebuggable. Fixes: eaad981291ee3("x86/entry/64: Introduce the FIND_PERCPU_BASE macro") Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
2021-02-05x86/Kconfig: Remove HPET_EMULATE_RTC depends on RTCAnand K Mistry1-1/+1
The RTC config option was removed in commit f52ef24be21a ("rtc/alpha: remove legacy rtc driver") Signed-off-by: Anand K Mistry <amistry@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20210204183205.1.If5c6ded53a00ecad6a02a1e974316291cc0239d1@changeid
2021-02-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds11-38/+61
Pull KVM fixes from Paolo Bonzini: "x86 has lots of small bugfixes, mostly one liners. It's quite late in 5.11-rc but none of them are related to this merge window; it's just bugs coming in at the wrong time. Of note among the others is "KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off" that fixes a live migration failure seen on distros that hadn't switched to tsx=off right away. ARM: - Avoid clobbering extra registers on initialisation" [ Sean Christopherson notes that commit 943dea8af21b ("KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit mode") should have had authorship credited to Jonny Barker, not to him. - Linus ] * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU reset KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs KVM: x86: cleanup CR3 reserved bits checks KVM: SVM: Treat SVM as unsupported when running as an SEV guest KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit mode KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID check KVM/x86: assign hva with the right value to vm_munmap the pages KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off Fix unsynchronized access to sev members through svm_register_enc_region KVM: Documentation: Fix documentation for nested. KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl KVM: arm64: Don't clobber x4 in __do_hyp_init
2021-02-05x86/sgx: Drop racy follow_pfn() checkDaniel Vetter1-8/+0
PTE insertion is fundamentally racy, and this check doesn't do anything useful. Quoting Sean: "Yeah, it can be whacked. The original, never-upstreamed code asserted that the resolved PFN matched the PFN being installed by the fault handler as a sanity check on the SGX driver's EPC management. The WARN assertion got dropped for whatever reason, leaving that useless chunk." Jason stumbled over this as a new user of follow_pfn(), and I'm trying to get rid of unsafe callers of that function so it can be locked down further. This is independent prep work for the referenced patch series: https://lore.kernel.org/dri-devel/20201127164131.2244124-1-daniel.vetter@ffwll.ch/ Fixes: 947c6e11fa43 ("x86/sgx: Add ptrace() support for the SGX driver") Reported-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/20210204184519.2809313-1-daniel.vetter@ffwll.ch
2021-02-05x86/asm: Fixup TASK_SIZE_MAX commentAlexey Dobriyan1-1/+1
Comment says "by preventing anything executable" which is not true. Even PROT_NONE mapping can't be installed at (1<<47 - 4096). mmap(0x7ffffffff000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM [ bp: Fixup to the moved location in page_64_types.h. ] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200305181719.GA5490@avx2
2021-02-04x86/apic: Add extra serialization for non-serializing MSRsDave Hansen5-15/+32
Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for MFENCE; LFENCE. Short summary: we have special MSRs that have weaker ordering than all the rest. Add fencing consistent with current SDM recommendations. This is not known to cause any issues in practice, only in theory. Longer story below: The reason the kernel uses a different semantic is that the SDM changed (roughly in late 2017). The SDM changed because folks at Intel were auditing all of the recommended fences in the SDM and realized that the x2apic fences were insufficient. Why was the pain MFENCE judged insufficient? WRMSR itself is normally a serializing instruction. No fences are needed because the instruction itself serializes everything. But, there are explicit exceptions for this serializing behavior written into the WRMSR instruction documentation for two classes of MSRs: IA32_TSC_DEADLINE and the X2APIC MSRs. Back to x2apic: WRMSR is *not* serializing in this specific case. But why is MFENCE insufficient? MFENCE makes writes visible, but only affects load/store instructions. WRMSR is unfortunately not a load/store instruction and is unaffected by MFENCE. This means that a non-serializing WRMSR could be reordered by the CPU to execute before the writes made visible by the MFENCE have even occurred in the first place. This means that an x2apic IPI could theoretically be triggered before there is any (visible) data to process. Does this affect anything in practice? I honestly don't know. It seems quite possible that by the time an interrupt gets to consume the (not yet) MFENCE'd data, it has become visible, mostly by accident. To be safe, add the SDM-recommended fences for all x2apic WRMSRs. This also leaves open the question of the _other_ weakly-ordered WRMSR: MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as the x2APIC MSRs, it seems substantially less likely to be a problem in practice. While writes to the in-memory Local Vector Table (LVT) might theoretically be reordered with respect to a weakly-ordered WRMSR like TSC_DEADLINE, the SDM has this to say: In x2APIC mode, the WRMSR instruction is used to write to the LVT entry. The processor ensures the ordering of this write and any subsequent WRMSR to the deadline; no fencing is required. But, that might still leave xAPIC exposed. The safest thing to do for now is to add the extra, recommended LFENCE. [ bp: Massage commit message, fix typos, drop accidentally added newline to tools/arch/x86/include/asm/barrier.h. ] Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
2021-02-04Revert "x86/setup: don't remove E820_TYPE_RAM for pfn 0"Mike Rapoport1-9/+11
This reverts commit bde9cfa3afe4324ec251e4af80ebf9b7afaf7afe. Changing the first memory page type from E820_TYPE_RESERVED to E820_TYPE_RAM makes it a part of "System RAM" resource rather than a reserved resource and this in turn causes devmem_is_allowed() to treat is as area that can be accessed but it is filled with zeroes instead of the actual data as previously. The change in /dev/mem output causes lilo to fail as was reported at slakware users forum, and probably other legacy applications will experience similar problems. Link: https://www.linuxquestions.org/questions/slackware-14/slackware-current-lilo-vesa-warnings-after-recent-updates-4175689617/#post6214439 Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-04KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU resetSean Christopherson1-0/+1
Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU reset. The reserved bits check needs to be done even if userspace never configures the guest's CPUID model. Cc: stable@vger.kernel.org Fixes: 0107973a80ad ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04bpf: Emit explicit NULL pointer checks for PROBE_LDX instructions.Alexei Starovoitov1-0/+19
PTR_TO_BTF_ID registers contain either kernel pointer or NULL. Emit the NULL check explicitly by JIT instead of going into do_user_addr_fault() on NULL deference. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20210202053837.95909-1-alexei.starovoitov@gmail.com
2021-02-04KVM: x86: Add helper to consolidate "raw" reserved GPA mask calculationsSean Christopherson4-9/+20
Add a helper to generate the mask of reserved GPA bits _without_ any adjustments for repurposed bits, and use it to replace a variety of open coded variants in the MTRR and APIC_BASE flows. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Add helper to generate mask of reserved HPA bitsSean Christopherson1-5/+9
Add a helper to generate the mask of reserved PA bits in the host. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bitsSean Christopherson3-59/+58
Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE) reserved bits checks. For SEV, the C-bit is ignored by hardware when walking pages tables, e.g. the APM states: Note that while the guest may choose to set the C-bit explicitly on instruction pages and page table addresses, the value of this bit is a don't-care in such situations as hardware always performs these as private accesses. Such behavior is expected to hold true for other features that repurpose GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both allow non-zero repurposed bits in the page tables. Conceptually, KVM should apply reserved GPA checks universally, and any features that do not adhere to the basic rule should be explicitly handled, i.e. if a GPA bit is repurposed but not allowed in page tables for whatever reason. Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved bits mask, and opportunistically clean up its code, e.g. to align lines and comments. Practically speaking, this is change is a likely a glorified nop given the current KVM code base. SEV's C-bit is the only repurposed GPA bit, and KVM doesn't support shadowing encrypted page tables (which is theoretically possible via SEV debug APIs). Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: SEV: Treat C-bit as legal GPA bit regardless of vCPU modeSean Christopherson6-9/+8
Rename cr3_lm_rsvd_bits to reserved_gpa_bits, and use it for all GPA legality checks. AMD's APM states: If the C-bit is an address bit, this bit is masked from the guest physical address when it is translated through the nested page tables. Thus, any access that can conceivably be run through NPT should ignore the C-bit when checking for validity. For features that KVM emulates in software, e.g. MTRRs, there is no clear direction in the APM for how the C-bit should be handled. For such cases, follow the SME behavior inasmuch as possible, since SEV is is essentially a VM-specific variant of SME. For SME, the APM states: In this case the upper physical address bits are treated as reserved when the feature is enabled except where otherwise indicated. Collecting the various relavant SME snippets in the APM and cross- referencing the omissions with Linux kernel code, this leaves MTTRs and APIC_BASE as the only flows that KVM emulates that should _not_ ignore the C-bit. Note, this means the reserved bit checks in the page tables are technically broken. This will be remedied in a future patch. Although the page table checks are technically broken, in practice, it's all but guaranteed to be irrelevant. NPT is required for SEV, i.e. shadowing page tables isn't needed in the common case. Theoretically, the checks could be in play for nested NPT, but it's extremely unlikely that anyone is running nested VMs on SEV, as doing so would require L1 to expose sensitive data to L0, e.g. the entire VMCB. And if anyone is running nested VMs, L0 can't read the guest's encrypted memory, i.e. L1 would need to put its NPT in shared memory, in which case the C-bit will never be set. Or, L1 could use shadow paging, but again, if L0 needs to read page tables, e.g. to load PDPTRs, the memory can't be encrypted if L1 has any expectation of L0 doing the right thing. Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: nSVM: Use common GPA helper to check for illegal CR3Sean Christopherson1-1/+1
Replace an open coded check for an invalid CR3 with its equivalent helper. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: VMX: Use GPA legality helpers to replace open coded equivalentsSean Christopherson2-20/+8
Replace a variety of open coded GPA checks with the recently introduced common helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: Add a helper to handle legal GPA with an alignment requirementSean Christopherson1-1/+7
Add a helper to genericize checking for a legal GPA that also must conform to an arbitrary alignment, and use it in the existing page_address_valid(). Future patches will replace open coded variants in VMX and SVM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: Add a helper to check for a legal GPASean Christopherson1-6/+11
Add a helper to check for a legal GPA, and use it to consolidate code in existing, related helpers. Future patches will extend usage to VMX and SVM code, properly handle exceptions to the maxphyaddr rule, and add more helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRsSean Christopherson1-1/+1
Don't clear the SME C-bit when reading a guest PDPTR, as the GPA (CR3) is in the guest domain. Barring a bizarre paravirtual use case, this is likely a benign bug. SME is not emulated by KVM, loading SEV guest PDPTRs is doomed as KVM can't use the correct key to read guest memory, and setting guest MAXPHYADDR higher than the host, i.e. overlapping the C-bit, would cause faults in the guest. Note, for SEV guests, stripping the C-bit is technically aligned with CPU behavior, but for KVM it's the greater of two evils. Because KVM doesn't have access to the guest's encryption key, ignoring the C-bit would at best result in KVM reading garbage. By keeping the C-bit, KVM will fail its read (unless userspace creates a memslot with the C-bit set). The guest will still undoubtedly die, as KVM will use '0' for the PDPTR value, but that's preferable to interpreting encrypted data as a PDPTR. Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU resetSean Christopherson1-0/+1
Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU reset. The reserved bits check needs to be done even if userspace never configures the guest's CPUID model. Cc: stable@vger.kernel.org Fixes: 0107973a80ad ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86: declare Xen HVM shared info capability and add test caseDavid Woodhouse1-1/+2
Instead of adding a plethora of new KVM_CAP_XEN_FOO capabilities, just add bits to the return value of KVM_CAP_XEN_HVM. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Add event channel interrupt vector upcallDavid Woodhouse5-1/+72
It turns out that we can't handle event channels *entirely* in userspace by delivering them as ExtINT, because KVM is a bit picky about when it accepts ExtINT interrupts from a legacy PIC. The in-kernel local APIC has to have LVT0 configured in APIC_MODE_EXTINT and unmasked, which isn't necessarily the case for Xen guests especially on secondary CPUs. To cope with this, add kvm_xen_get_interrupt() which checks the evtchn_pending_upcall field in the Xen vcpu_info, and delivers the Xen upcall vector (configured by KVM_XEN_ATTR_TYPE_UPCALL_VECTOR) if it's set regardless of LAPIC LVT0 configuration. This gives us the minimum support we need for completely userspace-based implementation of event channels. This does mean that vcpu_enter_guest() needs to check for the evtchn_pending_upcall flag being set, because it can't rely on someone having set KVM_REQ_EVENT unless we were to add some way for userspace to do so manually. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: register vcpu time info regionJoao Martins3-0/+22
Allow the Xen emulated guest the ability to register secondary vcpu time information. On Xen guests this is used in order to be mapped to userspace and hence allow vdso gettimeofday to work. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: setup pvclock updatesJoao Martins2-15/+21
Parameterise kvm_setup_pvclock_page() a little bit so that it can be invoked for different gfn_to_hva_cache structures, and with different offsets. Then we can invoke it for the normal KVM pvclock and also for the Xen one in the vcpu_info. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: register vcpu infoJoao Martins2-1/+26
The vcpu info supersedes the per vcpu area of the shared info page and the guest vcpus will use this instead. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Add KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTRDavid Woodhouse3-0/+52
This will be used for per-vCPU setup such as runstate and vcpu_info. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: update wallclock regionJoao Martins3-8/+43
Wallclock on Xen is written in the shared_info page. To that purpose, export kvm_write_wall_clock() and pass on the GPA of its location to populate the shared_info wall clock data. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04xen: add wc_sec_hi to struct shared_infoDavid Woodhouse1-0/+3
Xen added this in 2015 (Xen 4.6). On x86_64 and Arm it fills what was previously a 32-bit hole in the generic shared_info structure; on i386 it had to go at the end of struct arch_shared_info. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: register shared_info pageJoao Martins2-4/+38
Add KVM_XEN_ATTR_TYPE_SHARED_INFO to allow hypervisor to know where the guest's shared info page is. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: add definitions of compat_shared_info, compat_vcpu_infoDavid Woodhouse1-0/+36
There aren't a lot of differences for the things that the kernel needs to care about, but there are a few. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: latch long_mode when hypercall page is set upDavid Woodhouse2-1/+21
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTRJoao Martins3-0/+52
This will be used to set up shared info pages etc. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Add kvm_xen_enabled static keyDavid Woodhouse3-2/+27
The code paths for Xen support are all fairly lightweight but if we hide them behind this, they're even *more* lightweight for any system which isn't actually hosting Xen guests. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Move KVM_XEN_HVM_CONFIG handling to xen.cDavid Woodhouse3-13/+20
This is already more complex than the simple memcpy it originally had. Move it to xen.c with the rest of the Xen support. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Fix coexistence of Xen and Hyper-V hypercallsJoao Martins2-11/+35
Disambiguate Xen vs. Hyper-V calls by adding 'orl $0x80000000, %eax' at the start of the Hyper-V hypercall page when Xen hypercalls are also enabled. That bit is reserved in the Hyper-V ABI, and those hypercall numbers will never be used by Xen (because it does precisely the same trick). Switch to using kvm_vcpu_write_guest() while we're at it, instead of open-coding it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: intercept xen hypercalls if enabledJoao Martins6-32/+224
Add a new exit reason for emulator to handle Xen hypercalls. Since this means KVM owns the ABI, dispense with the facility for the VMM to provide its own copy of the hypercall pages; just fill them in directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page. This behaviour is enabled by a new INTERCEPT_HCALL flag in the KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag being returned from the KVM_CAP_XEN_HVM check. Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it to the nascent xen.c while we're at it, and add a test case. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: Fix __user pointer handling for hypercall page installationDavid Woodhouse1-3/+5
The address we give to memdup_user() isn't correctly tagged as __user. This is harmless enough as it's a one-off use and we're doing exactly the right thing, but fix it anyway to shut the checker up. Otherwise it'll whine when the (now legacy) code gets moved around in a later patch. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04KVM: x86/xen: fix Xen hypercall page msr handlingJoao Martins1-2/+3
Xen usually places its MSR at 0x40000000 or 0x40000200 depending on whether it is running in viridian mode or not. Note that this is not ABI guaranteed, so it is possible for Xen to advertise the MSR some place else. Given the way xen_hvm_config() is handled, if the former address is selected, this will conflict with Hyper-V's MSR (HV_X64_MSR_GUEST_OS_ID) which unconditionally uses the same address. Given that the MSR location is arbitrary, move the xen_hvm_config() handling to the top of kvm_set_msr_common() before falling through. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04x86/ptrace: Clean up PTRACE_GETREGS/PTRACE_PUTREGS regset selectionAndy Lutomirski1-8/+38
task_user_regset_view() has nonsensical semantics, but those semantics appear to be relied on by existing users of PTRACE_GETREGSET and PTRACE_SETREGSET. (See added comments below for details.) It shouldn't be used for PTRACE_GETREGS or PTRACE_SETREGS, though. A native 64-bit ptrace() call and an x32 ptrace() call using GETREGS or SETREGS wants the 64-bit regset views, and a 32-bit ptrace() call (native or compat) should use the 32-bit regset. task_user_regset_view() almost does this except that it will malfunction if a ptracer is itself ptraced and the outer ptracer modifies CS on entry to a ptrace() syscall. Hopefully that has never happened. (The compat ptrace() code already hardcoded the 32-bit regset, so this change has no effect on that path.) Improve the situation and deobfuscate the code by hardcoding the 64-bit view in the x32 ptrace() and selecting the view based on the kernel config in the native ptrace(). I tried to figure out the history behind this API. I naïvely assumed that PTRAGE_GETREGSET and PTRACE_SETREGSET were ancient APIs that predated compat, but no. They were introduced by 2225a122ae26 ("ptrace: Add support for generic PTRACE_GETREGSET/PTRACE_SETREGSET") in 2010, and they are simply a poor design. ELF core dumps have the ELF e_machine field and a bunch of register sets in ELF notes, and the pair (e_machine, NT_XXX) indicates the format of the regset blob. But the new PTRACE_GET/SETREGSET API coopted the NT_XXX numbering without any way to specify which e_machine was in effect. This is especially bad on x86, where a process can freely switch between 32-bit and 64-bit mode, and, in fact, the PTRAGE_SETREGSET call itself can cause this switch to happen. Oops. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/9daa791d0c7eaebd59c5bc2b2af1b0e7bebe707d.1612375698.git.luto@kernel.org
2021-02-04KVM: x86/mmu: Allow parallel page faults for the TDP MMUBen Gardon1-2/+10
Make the last few changes necessary to enable the TDP MMU to handle page faults in parallel while holding the mmu_lock in read mode. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-24-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Mark SPTEs in disconnected pages as removedBen Gardon1-6/+30
When clearing TDP MMU pages what have been disconnected from the paging structure root, set the SPTEs to a special non-present value which will not be overwritten by other threads. This is needed to prevent races in which a thread is clearing a disconnected page table, but another thread has already acquired a pointer to that memory and installs a mapping in an already cleared entry. This can lead to memory leaks and accounting errors. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-23-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handlerBen Gardon2-10/+74
When the TDP MMU is allowed to handle page faults in parallel there is the possiblity of a race where an SPTE is cleared and then imediately replaced with a present SPTE pointing to a different PFN, before the TLBs can be flushed. This race would violate architectural specs. Ensure that the TLBs are flushed properly before other threads are allowed to install any present value for the SPTE. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-22-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU mapBen Gardon2-33/+122
To prepare for handling page faults in parallel, change the TDP MMU page fault handler to use atomic operations to set SPTEs so that changes are not lost if multiple threads attempt to modify the same SPTE. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-21-bgardon@google.com> [Document new locking rules. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Factor out functions to add/remove TDP MMU pagesBen Gardon1-8/+39
Move the work of adding and removing TDP MMU pages to/from "secondary" data structures to helper functions. These functions will be built on in future commits to enable MMU operations to proceed (mostly) in parallel. No functional change expected. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-20-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Use an rwlock for the x86 MMUBen Gardon6-65/+67
Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-19-bgardon@google.com> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Protect TDP MMU page table memory with RCUBen Gardon4-21/+103
In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Clear dirtied pages mask bit before early breakBen Gardon1-2/+2
In clear_dirty_pt_masked, the loop is intended to exit early after processing each of the GFNs with corresponding bits set in mask. This does not work as intended if another thread has already cleared the dirty bit or writable bit on the SPTE. In that case, the loop would proceed to the next iteration early and the bit in mask would not be cleared. As a result the loop could not exit early and would proceed uselessly. Move the unsetting of the mask bit before the check for a no-op SPTE change. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-17-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Skip no-op changes in TDP MMU functionsBen Gardon1-2/+4
Skip setting SPTEs if no change is expected. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-16-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changedBen Gardon1-10/+22
Given certain conditions, some TDP MMU functions may not yield reliably / frequently enough. For example, if a paging structure was very large but had few, if any writable entries, wrprot_gfn_range could traverse many entries before finding a writable entry and yielding because the check for yielding only happens after an SPTE is modified. Fix this issue by moving the yield to the beginning of the loop. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-15-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>