summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2021-05-07Merge tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-blockLinus Torvalds3-19/+70
Pull io_uring fixes from Jens Axboe: "Mostly fixes for merge window merged code. In detail: - Error case memory leak fixes (Colin, Zqiang) - Add the tools/io_uring/ to the list of maintained files (Lukas) - Set of fixes for the modified buffer registration API (Pavel) - Sanitize io thread setup on x86 (Stefan) - Ensure we truncate transfer count for registered buffers (Thadeu)" * tag 'io_uring-5.13-2021-05-07' of git://git.kernel.dk/linux-block: x86/process: setup io_threads more like normal user space threads MAINTAINERS: add io_uring tool to IO_URING io_uring: truncate lengths larger than MAX_RW_COUNT on provide buffers io_uring: Fix memory leak in io_sqe_buffers_register() io_uring: Fix premature return from loop and memory leak io_uring: fix unchecked error in switch_start() io_uring: allow empty slots for reg buffers io_uring: add more build check for uapi io_uring: dont overlap internal and user req flags io_uring: fix drain with rsrc CQEs
2021-05-07Merge tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds48-715/+1112
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Add validation of the UDP retrans parameter to prevent shift out-of-bounds - Don't discard pNFS layout segments that are marked for return Bugfixes: - Fix a NULL dereference crash in xprt_complete_bc_request() when the NFSv4.1 server misbehaves. - Fix the handling of NFS READDIR cookie verifiers - Sundry fixes to ensure attribute revalidation works correctly when the server does not return post-op attributes. - nfs4_bitmask_adjust() must not change the server global bitmasks - Fix major timeout handling in the RPC code. - NFSv4.2 fallocate() fixes. - Fix the NFSv4.2 SEEK_HOLE/SEEK_DATA end-of-file handling - Copy offload attribute revalidation fixes - Fix an incorrect filehandle size check in the pNFS flexfiles driver - Fix several RDMA transport setup/teardown races - Fix several RDMA queue wrapping issues - Fix a misplaced memory read barrier in sunrpc's call_decode() Features: - Micro optimisation of the TCP transmission queue using TCP_CORK - statx() performance improvements by further splitting up the tracking of invalid cached file metadata. - Support the NFSv4.2 'change_attr_type' attribute and use it to optimise handling of change attribute updates" * tag 'nfs-for-5.13-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (85 commits) xprtrdma: Fix a NULL dereference in frwr_unmap_sync() sunrpc: Fix misplaced barrier in call_decode NFSv4.2: Remove ifdef CONFIG_NFSD from NFSv4.2 client SSC code. xprtrdma: Move fr_mr field to struct rpcrdma_mr xprtrdma: Move the Work Request union to struct rpcrdma_mr xprtrdma: Move fr_linv_done field to struct rpcrdma_mr xprtrdma: Move cqe to struct rpcrdma_mr xprtrdma: Move fr_cid to struct rpcrdma_mr xprtrdma: Remove the RPC/RDMA QP event handler xprtrdma: Don't display r_xprt memory addresses in tracepoints xprtrdma: Add an rpcrdma_mr_completion_class xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation xprtrdma: Avoid Send Queue wrapping xprtrdma: Do not wake RPC consumer on a failed LocalInv xprtrdma: Do not recycle MR after FastReg/LocalInv flushes xprtrdma: Clarify use of barrier in frwr_wc_localinv_done() xprtrdma: Rename frwr_release_mr() xprtrdma: rpcrdma_mr_pop() already does list_del_init() xprtrdma: Delete rpcrdma_recv_buffer_put() xprtrdma: Fix cwnd update ordering ...
2021-05-07Merge tag '9p-for-5.13-rc1' of git://github.com/martinetd/linuxLinus Torvalds2-3/+3
Pull 9p updates from Dominique Martinet: "An error handling fix and constification" * tag '9p-for-5.13-rc1' of git://github.com/martinetd/linux: fs: 9p: fix v9fs_file_open writeback fid error check 9p: Constify static struct v9fs_attr_group
2021-05-07i40e: Remove LLDP frame filtersArkadiusz Kubalewski3-44/+0
Remove filters from being setup in case of software DCB and allow the LLDP frames to be properly transmitted to the wire. It is not possible to transmit the LLDP frame out of the port, if they are filtered by control VSI. This prohibits software LLDP agent properly communicate its DCB capabilities to the neighbors. Fixes: 4b208eaa8078 ("i40e: Add init and default config of software based DCB") Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Imam Hassan Reza Biswas <imam.hassan.reza.biswas@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-05-07i40e: Fix PHY type identifiers for 2.5G and 5G adaptersMateusz Palczewski4-11/+10
Unlike other supported adapters, 2.5G and 5G use different PHY type identifiers for reading/writing PHY settings and for reading link status. This commit introduces separate PHY identifiers for these two operation types. Fixes: 2e45d3f4677a ("i40e: Add support for X710 B/P & SFP+ cards") Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-05-07i40e: fix the restart auto-negotiation after FEC modifiedJaroslaw Gawin1-1/+2
When FEC mode was changed the link didn't know it because the link was not reset and new parameters were not negotiated. Set a flag 'I40E_AQ_PHY_ENABLE_ATOMIC_LINK' in 'abilities' to restart the link and make it run with the new settings. Fixes: 1d96340196f1 ("i40e: Add support FEC configuration for Fortville 25G") Signed-off-by: Jaroslaw Gawin <jaroslawx.gawin@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-05-07i40e: Fix use-after-free in i40e_client_subtask()Yunjian Wang1-0/+1
Currently the call to i40e_client_del_instance frees the object pf->cinst, however pf->cinst->lan_info is being accessed after the free. Fix this by adding the missing return. Addresses-Coverity: ("Read from pointer after free") Fixes: 7b0b1a6d0ac9 ("i40e: Disable iWARP VSI PETCP_ENA flag on netdev down events") Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-05-07i40e: fix broken XDP supportMagnus Karlsson1-6/+2
Commit 12738ac4754e ("i40e: Fix sparse errors in i40e_txrx.c") broke XDP support in the i40e driver. That commit was fixing a sparse error in the code by introducing a new variable xdp_res instead of overloading this into the skb pointer. The problem is that the code later uses the skb pointer in if statements and these where not extended to also test for the new xdp_res variable. Fix this by adding the correct tests for xdp_res in these places. The skb pointer was used to store the result of the XDP program by overloading the results in the error pointer ERR_PTR(-result). Therefore, the allocation failure test that used to only test for !skb now need to be extended to also consider !xdp_res. i40e_cleanup_headers() had a check that based on the skb value being an error pointer, i.e. a result from the XDP program != XDP_PASS, and if so start to process a new packet immediately, instead of populating skb fields and sending the skb to the stack. This check is not needed anymore, since we have added an explicit test for xdp_res being set and if so just do continue to pick the next packet from the NIC. Fixes: 12738ac4754e ("i40e: Fix sparse errors in i40e_txrx.c") Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2021-05-07KVM: SVM: Move GHCB unmapping to fix RCU warningTom Lendacky3-4/+5
When an SEV-ES guest is running, the GHCB is unmapped as part of the vCPU run support. However, kvm_vcpu_unmap() triggers an RCU dereference warning with CONFIG_PROVE_LOCKING=y because the SRCU lock is released before invoking the vCPU run support. Move the GHCB unmapping into the prepare_guest_switch callback, which is invoked while still holding the SRCU lock, eliminating the RCU dereference warning. Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Reported-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Message-Id: <b2f9b79d15166f2c3e4375c0d9bc3268b7696455.1620332081.git.thomas.lendacky@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: SVM: Invert user pointer casting in SEV {en,de}crypt helpersSean Christopherson1-13/+11
Invert the user pointer params for SEV's helpers for encrypting and decrypting guest memory so that they take a pointer and cast to an unsigned long as necessary, as opposed to doing the opposite. Tagging a non-pointer as __user is confusing and weird since a cast of some form needs to occur to actually access the user data. This also fixes Sparse warnings triggered by directly consuming the unsigned longs, which are "noderef" due to the __user tag. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210506231542.2331138-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07kvm: Cap halt polling at kvm->max_halt_poll_nsDavid Matlack1-2/+2
When growing halt-polling, there is no check that the poll time exceeds the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past kvm->max_halt_poll_ns and stay there until a halt which takes longer than kvm->halt_poll_ns. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Venkatesh Srinivas <venkateshs@chromium.org> Message-Id: <20210506152442.4010298-1-venkateshs@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07tools/kvm_stat: Fix documentation typoStefan Raspl1-1/+1
Makes the dash in front of option '-z' disappear in the generated man-page. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Message-Id: <20210506140352.4178789-1-raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Prevent deadlock against tk_core.seqThomas Gleixner1-4/+18
syzbot reported a possible deadlock in pvclock_gtod_notify(): CPU 0 CPU 1 write_seqcount_begin(&tk_core.seq); pvclock_gtod_notify() spin_lock(&pool->lock); queue_work(..., &pvclock_gtod_work) ktime_get() spin_lock(&pool->lock); do { seq = read_seqcount_begin(tk_core.seq) ... } while (read_seqcount_retry(&tk_core.seq, seq); While this is unlikely to happen, it's possible. Delegate queue_work() to irq_work() which postpones it until the tk_core.seq write held region is left and interrupts are reenabled. Fixes: 16e8d74d2da9 ("KVM: x86: notifier for clocksource changes") Reported-by: syzbot+6beae4000559d41d80f8@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Message-Id: <87h7jgm1zy.ffs@nanos.tec.linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Cancel pvclock_gtod_work on module removalThomas Gleixner1-0/+1
Nothing prevents the following: pvclock_gtod_notify() queue_work(system_long_wq, &pvclock_gtod_work); ... remove_module(kvm); ... work_queue_run() pvclock_gtod_work() <- UAF Ditto for any other operation on that workqueue list head which touches pvclock_gtod_work after module removal. Cancel the work in kvm_arch_exit() to prevent that. Fixes: 16e8d74d2da9 ("KVM: x86: notifier for clocksource changes") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Message-Id: <87czu4onry.ffs@nanos.tec.linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Prevent KVM SVM from loading on kernels with 5-level pagingSean Christopherson3-11/+16
Disallow loading KVM SVM if 5-level paging is supported. In theory, NPT for L1 should simply work, but there unknowns with respect to how the guest's MAXPHYADDR will be handled by hardware. Nested NPT is more problematic, as running an L1 VMM that is using 2-level page tables requires stacking single-entry PDP and PML4 tables in KVM's NPT for L2, as there are no equivalent entries in L1's NPT to shadow. Barring hardware magic, for 5-level paging, KVM would need stack another layer to handle PML5. Opportunistically rename the lm_root pointer, which is used for the aforementioned stacking when shadowing 2-level L1 NPT, to pml4_root to call out that it's specifically for PML4. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210505204221.1934471-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: X86: Expose bus lock debug exception to guestPaolo Bonzini3-1/+7
Bus lock debug exception is an ability to notify the kernel by an #DB trap after the instruction acquires a bus lock and is executed when CPL>0. This allows the kernel to enforce user application throttling or mitigations. Existence of bus lock debug exception is enumerated via CPUID.(EAX=7,ECX=0).ECX[24]. Software can enable these exceptions by setting bit 2 of the MSR_IA32_DEBUGCTL. Expose the CPUID to guest and emulate the MSR handling when guest enables it. Support for this feature was originally developed by Xiaoyao Li and Chenyi Qiang, but code has since changed enough that this patch has nothing in common with theirs, except for this commit message. Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20210202090433.13441-4-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: X86: Add support for the emulation of DR6_BUS_LOCK bitChenyi Qiang2-1/+5
Bus lock debug exception introduces a new bit DR6_BUS_LOCK (bit 11 of DR6) to indicate that bus lock #DB exception is generated. The set/clear of DR6_BUS_LOCK is similar to the DR6_RTM. The processor clears DR6_BUS_LOCK when the exception is generated. For all other #DB, the processor sets this bit to 1. Software #DB handler should set this bit before returning to the interrupted task. In VMM, to avoid breaking the CPUs without bus lock #DB exception support, activate the DR6_BUS_LOCK conditionally in DR6_FIXED_1 bits. When intercepting the #DB exception caused by bus locks, bit 11 of the exit qualification is set to identify it. The VMM should emulate the exception by clearing the bit 11 of the guest DR6. Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20210202090433.13441-3-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: PPC: Book3S HV: Fix conversion to gfn-based MMU notifier callbacksNicholas Piggin3-17/+36
Commit b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier callbacks") causes unmap_gfn_range and age_gfn callbacks to only work on the first gfn in the range. It also makes the aging callbacks call into both radix and hash aging functions for radix guests. Fix this. Add warnings for the single-gfn calls that have been converted to range callbacks, in case they ever receieve ranges greater than 1. Fixes: b1c5356e873c ("KVM: PPC: Convert to the gfn-based MMU notifier callbacks") Reported-by: Bharata B Rao <bharata@linux.ibm.com> Tested-by: Bharata B Rao <bharata@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20210505121509.1470207-1-npiggin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Hide RDTSCP and RDPID if MSR_TSC_AUX probing failedSean Christopherson1-0/+15
If probing MSR_TSC_AUX failed, hide RDTSCP and RDPID, and WARN if either feature was reported as supported. In theory, such a scenario should never happen as both Intel and AMD state that MSR_TSC_AUX is available if RDTSCP or RDPID is supported. But, KVM injects #GP on MSR_TSC_AUX accesses if probing failed, faults on WRMSR(MSR_TSC_AUX) may be fatal to the guest (because they happen during early CPU bringup), and KVM itself has effectively misreported RDPID support in the past. Note, this also has the happy side effect of omitting MSR_TSC_AUX from the list of MSRs that are exposed to userspace if probing the MSR fails. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Tie Intel and AMD behavior for MSR_TSC_AUX to guest CPU modelSean Christopherson4-39/+41
Squish the Intel and AMD emulation of MSR_TSC_AUX together and tie it to the guest CPU model instead of the host CPU behavior. While not strictly necessary to avoid guest breakage, emulating cross-vendor "architecture" will provide consistent behavior for the guest, e.g. WRMSR fault behavior won't change if the vCPU is migrated to a host with divergent behavior. Note, the "new" kvm_is_supported_user_return_msr() checks do not add new functionality on either SVM or VMX. On SVM, the equivalent was "tsc_aux_uret_slot < 0", and on VMX the check was buried in the vmx_find_uret_msr() call at the find_uret_msr label. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Move uret MSR slot management to common x86Sean Christopherson4-29/+17
Now that SVM and VMX both probe MSRs before "defining" user return slots for them, consolidate the code for probe+define into common x86 and eliminate the odd behavior of having the vendor code define the slot for a given MSR. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Export the number of uret MSRs to vendor modulesSean Christopherson2-16/+14
Split out and export the number of configured user return MSRs so that VMX can iterate over the set of MSRs without having to do its own tracking. Keep the list itself internal to x86 so that vendor code still has to go through the "official" APIs to add/modify entries. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Disable loading of TSX_CTRL MSR the more conventional waySean Christopherson1-12/+10
Tag TSX_CTRL as not needing to be loaded when RTM isn't supported in the host. Crushing the write mask to '0' has the same effect, but requires more mental gymnastics to understand. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Use common x86's uret MSR list as the one true listSean Christopherson3-57/+53
Drop VMX's global list of user return MSRs now that VMX doesn't resort said list to isolate "active" MSRs, i.e. now that VMX's list and x86's list have the same MSRs in the same order. In addition to eliminating the redundant list, this will also allow moving more of the list management into common x86. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Use flag to indicate "active" uret MSRs instead of sorting listSean Christopherson2-40/+42
Explicitly flag a uret MSR as needing to be loaded into hardware instead of resorting the list of "active" MSRs and tracking how many MSRs in total need to be loaded. The only benefit to sorting the list is that the loop to load MSRs during vmx_prepare_switch_to_guest() doesn't need to iterate over all supported uret MRS, only those that are active. But that is a pointless optimization, as the most common case, running a 64-bit guest, will load the vast majority of MSRs. Not to mention that a single WRMSR is far more expensive than iterating over the list. Providing a stable list order obviates the need to track a given MSR's "slot" in the per-CPU list of user return MSRs; all lists simply use the same ordering. Future patches will take advantage of the stable order to further simplify the related code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Configure list of user return MSRs at module initSean Christopherson2-21/+50
Configure the list of user return MSRs that are actually supported at module init instead of reprobing the list of possible MSRs every time a vCPU is created. Curating the list on a per-vCPU basis is pointless; KVM is completely hosed if the set of supported MSRs changes after module init, or if the set of MSRs differs per physical PCU. The per-vCPU lists also increase complexity (see __vmx_find_uret_msr()) and creates corner cases that _should_ be impossible, but theoretically exist in KVM, e.g. advertising RDTSCP to userspace without actually being able to virtualize RDTSCP if probing MSR_TSC_AUX fails. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Add support for RDPID without RDTSCPSean Christopherson3-7/+29
Allow userspace to enable RDPID for a guest without also enabling RDTSCP. Aside from checking for RDPID support in the obvious flows, VMX also needs to set ENABLE_RDTSCP=1 when RDPID is exposed. For the record, there is no known scenario where enabling RDPID without RDTSCP is desirable. But, both AMD and Intel architectures allow for the condition, i.e. this is purely to make KVM more architecturally accurate. Fixes: 41cd02c6f7f6 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID") Cc: stable@vger.kernel.org Reported-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: SVM: Probe and load MSR_TSC_AUX regardless of RDTSCP support in hostSean Christopherson1-8/+10
Probe MSR_TSC_AUX whether or not RDTSCP is supported in the host, and if probing succeeds, load the guest's MSR_TSC_AUX into hardware prior to VMRUN. Because SVM doesn't support interception of RDPID, RDPID cannot be disallowed in the guest (without resorting to binary translation). Leaving the host's MSR_TSC_AUX in hardware would leak the host's value to the guest if RDTSCP is not supported. Note, there is also a kernel bug that prevents leaking the host's value. The host kernel initializes MSR_TSC_AUX if and only if RDTSCP is supported, even though the vDSO usage consumes MSR_TSC_AUX via RDPID. I.e. if RDTSCP is not supported, there is no host value to leak. But, if/when the host kernel bug is fixed, KVM would start leaking MSR_TSC_AUX in the case where hardware supports RDPID but RDTSCP is unavailable for whatever reason. Probing MSR_TSC_AUX will also allow consolidating the probe and define logic in common x86, and will make it simpler to condition the existence of MSR_TSX_AUX (from the guest's perspective) on RDTSCP *or* RDPID. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Disable preemption when probing user return MSRsSean Christopherson3-4/+18
Disable preemption when probing a user return MSR via RDSMR/WRMSR. If the MSR holds a different value per logical CPU, the WRMSR could corrupt the host's value if KVM is preempted between the RDMSR and WRMSR, and then rescheduled on a different CPU. Opportunistically land the helper in common x86, SVM will use the helper in a future commit. Fixes: 4be534102624 ("KVM: VMX: Initialize vmx->guest_msrs[] right after allocation") Cc: stable@vger.kernel.org Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-6-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Move RDPID emulation intercept to its own enumSean Christopherson3-2/+4
Add a dedicated intercept enum for RDPID instead of piggybacking RDTSCP. Unlike VMX's ENABLE_RDTSCP, RDPID is not bound to SVM's RDTSCP intercept. Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-5-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: SVM: Inject #UD on RDTSCP when it should be disabled in the guestSean Christopherson1-4/+13
Intercept RDTSCP to inject #UD if RDTSC is disabled in the guest. Note, SVM does not support intercepting RDPID. Unlike VMX's ENABLE_RDTSCP control, RDTSCP interception does not apply to RDPID. This is a benign virtualization hole as the host kernel (incorrectly) sets MSR_TSC_AUX if RDTSCP is supported, and KVM loads the guest's MSR_TSC_AUX into hardware if RDTSCP is supported in the host, i.e. KVM will not leak the host's MSR_TSC_AUX to the guest. But, when the kernel bug is fixed, KVM will start leaking the host's MSR_TSC_AUX if RDPID is supported in hardware, but RDTSCP isn't available for whatever reason. This leak will be remedied in a future commit. Fixes: 46896c73c1a4 ("KVM: svm: add support for RDTSCP") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-4-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Emulate RDPID only if RDTSCP is supportedSean Christopherson1-1/+2
Do not advertise emulation support for RDPID if RDTSCP is unsupported. RDPID emulation subtly relies on MSR_TSC_AUX to exist in hardware, as both vmx_get_msr() and svm_get_msr() will return an error if the MSR is unsupported, i.e. ctxt->ops->get_msr() will fail and the emulator will inject a #UD. Note, RDPID emulation also relies on RDTSCP being enabled in the guest, but this is a KVM bug and will eventually be fixed. Fixes: fb6d4d340e05 ("KVM: x86: emulate RDPID") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-3-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: VMX: Do not advertise RDPID if ENABLE_RDTSCP control is unsupportedSean Christopherson1-2/+4
Clear KVM's RDPID capability if the ENABLE_RDTSCP secondary exec control is unsupported. Despite being enumerated in a separate CPUID flag, RDPID is bundled under the same VMCS control as RDTSCP and will #UD in VMX non-root if ENABLE_RDTSCP is not enabled. Fixes: 41cd02c6f7f6 ("kvm: x86: Expose RDPID in KVM_GET_SUPPORTED_CPUID") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210504171734.1434054-2-seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Reiji Watanabe <reijiw@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: nSVM: remove a warning about vmcb01 VM exit reasonMaxim Levitsky1-1/+0
While in most cases, when returning to use the VMCB01, the exit reason stored in it will be SVM_EXIT_VMRUN, on first VM exit after a nested migration this field can contain anything since the VM entry did happen before the migration. Remove this warning to avoid the false positive. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210504143936.1644378-3-mlevitsk@redhat.com> Fixes: 9a7de6ecc3ed ("KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit()") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: nSVM: always restore the L1's GIF on migrationMaxim Levitsky1-0/+2
While usually the L1's GIF is set while L2 runs, and usually migration nested state is loaded after a vCPU reset which also sets L1's GIF to true, this is not guaranteed. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210504143936.1644378-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: x86: Hoist input checks in kvm_add_msr_filter()Siddharth Chandrasekaran1-19/+7
In ioctl KVM_X86_SET_MSR_FILTER, input from user space is validated after a memdup_user(). For invalid inputs we'd memdup and then call kfree unnecessarily. Hoist input validation to avoid kfree altogether. Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <20210503122111.13775-1-sidcha@amazon.de> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07selftests: kvm: remove reassignment of non-absolute variablesBill Wendling1-2/+2
Clang's integrated assembler does not allow symbols with non-absolute values to be reassigned. Modify the interrupt entry loop macro to be compatible with IAS by using a label and an offset. Cc: Jian Cai <caij2003@gmail.com> Signed-off-by: Bill Wendling <morbo@google.com> References: https://lore.kernel.org/lkml/20200714233024.1789985-1-caij2003@gmail.com/ Message-Id: <20201211012317.3722214-1-morbo@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: nVMX: Properly pad 'struct kvm_vmx_nested_state_hdr'Vitaly Kuznetsov1-0/+2
Eliminate the probably unwanted hole in 'struct kvm_vmx_nested_state_hdr': Pre-patch: struct kvm_vmx_nested_state_hdr { __u64 vmxon_pa; /* 0 8 */ __u64 vmcs12_pa; /* 8 8 */ struct { __u16 flags; /* 16 2 */ } smm; /* 16 2 */ /* XXX 2 bytes hole, try to pack */ __u32 flags; /* 20 4 */ __u64 preemption_timer_deadline; /* 24 8 */ }; Post-patch: struct kvm_vmx_nested_state_hdr { __u64 vmxon_pa; /* 0 8 */ __u64 vmcs12_pa; /* 8 8 */ struct { __u16 flags; /* 16 2 */ } smm; /* 16 2 */ __u16 pad; /* 18 2 */ __u32 flags; /* 20 4 */ __u64 preemption_timer_deadline; /* 24 8 */ }; Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210503150854.1144255-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: selftests: evmcs_test: Check that VMCS12 is alway properly synced to ↵Vitaly Kuznetsov1-11/+55
eVMCS after restore Add a test for the regression, introduced by commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit"). When L2->L1 exit is forced immediately after restoring nested state, KVM_REQ_GET_NESTED_STATE_PAGES request is cleared and VMCS12 changes (e.g. fresh RIP) are not reflected to eVMCS. The consequent nested vCPU run gets broken. Utilize NMI injection to do the job. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210505151823.1341678-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: selftests: evmcs_test: Check that VMLAUNCH with bogus EVMPTR is causing #UDVitaly Kuznetsov1-8/+16
'run->exit_reason == KVM_EXIT_SHUTDOWN' check is not ideal as we may be getting some unexpected exception. Directly check for #UD instead. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210505151823.1341678-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07KVM: nVMX: Always make an attempt to map eVMCS after migrationVitaly Kuznetsov1-10/+19
When enlightened VMCS is in use and nested state is migrated with vmx_get_nested_state()/vmx_set_nested_state() KVM can't map evmcs page right away: evmcs gpa is not 'struct kvm_vmx_nested_state_hdr' and we can't read it from VP assist page because userspace may decide to restore HV_X64_MSR_VP_ASSIST_PAGE after restoring nested state (and QEMU, for example, does exactly that). To make sure eVMCS is mapped /vmx_set_nested_state() raises KVM_REQ_GET_NESTED_STATE_PAGES request. Commit f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit") added KVM_REQ_GET_NESTED_STATE_PAGES clearing to nested_vmx_vmexit() to make sure MSR permission bitmap is not switched when an immediate exit from L2 to L1 happens right after migration (caused by a pending event, for example). Unfortunately, in the exact same situation we still need to have eVMCS mapped so nested_sync_vmcs12_to_shadow() reflects changes in VMCS12 to eVMCS. As a band-aid, restore nested_get_evmcs_page() when clearing KVM_REQ_GET_NESTED_STATE_PAGES in nested_vmx_vmexit(). The 'fix' is far from being ideal as we can't easily propagate possible failures and even if we could, this is most likely already too late to do so. The whole 'KVM_REQ_GET_NESTED_STATE_PAGES' idea for mapping eVMCS after migration seems to be fragile as we diverge too much from the 'native' path when vmptr loading happens on vmx_set_nested_state(). Fixes: f2c7ef3ba955 ("KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210503150854.1144255-2-vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07doc/kvm: Fix wrong entry for KVM_CAP_X86_MSR_FILTERSiddharth Chandrasekaran1-2/+2
The capability that exposes new ioctl KVM_X86_SET_MSR_FILTER to userspace is specified incorrectly as the ioctl itself (instead of KVM_CAP_X86_MSR_FILTER). This patch fixes it. Fixes: 1a155254ff93 ("KVM: x86: Introduce MSR filtering") Reviewed-by: Alexander Graf <graf@amazon.de> Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <20210503120059.9283-1-sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Unify kvm_pv_guest_cpu_reboot() with kvm_guest_cpu_offline()Vitaly Kuznetsov1-25/+17
Simplify the code by making PV features shutdown happen in one place. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable all PV features on crashVitaly Kuznetsov3-39/+32
Crash shutdown handler only disables kvmclock and steal time, other PV features remain active so we risk corrupting memory or getting some side-effects in kdump kernel. Move crash handler to kvm.c and unify with CPU offline. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Disable kvmclock on all CPUs on shutdownVitaly Kuznetsov3-6/+4
Currenly, we disable kvmclock from machine_shutdown() hook and this only happens for boot CPU. We need to disable it for all CPUs to guard against memory corruption e.g. on restore from hibernate. Note, writing '0' to kvmclock MSR doesn't clear memory location, it just prevents hypervisor from updating the location so for the short while after write and while CPU is still alive, the clock remains usable and correct so we don't need to switch to some other clocksource. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07x86/kvm: Teardown PV features on boot CPU as wellVitaly Kuznetsov1-16/+40
Various PV features (Async PF, PV EOI, steal time) work through memory shared with hypervisor and when we restore from hibernation we must properly teardown all these features to make sure hypervisor doesn't write to stale locations after we jump to the previously hibernated kernel (which can try to place anything there). For secondary CPUs the job is already done by kvm_cpu_down_prepare(), register syscore ops to do the same for boot CPU. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210414123544.1060604-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-05-07netfilter: nftables: avoid potential overflows on 32bit archesEric Dumazet2-7/+10
User space could ask for very large hash tables, we need to make sure our size computations wont overflow. nf_tables_newset() needs to double check the u64 size will fit into size_t field. Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-07netfilter: nftables: avoid overflows in nft_hash_buckets()Eric Dumazet1-1/+9
Number of buckets being stored in 32bit variables, we have to ensure that no overflows occur in nft_hash_buckets() syzbot injected a size == 0x40000000 and reported: UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 1 PID: 29539 Comm: syz-executor.4 Not tainted 5.12.0-rc7-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x141/0x1d7 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:327 __roundup_pow_of_two include/linux/log2.h:57 [inline] nft_hash_buckets net/netfilter/nft_set_hash.c:411 [inline] nft_hash_estimate.cold+0x19/0x1e net/netfilter/nft_set_hash.c:652 nft_select_set_ops net/netfilter/nf_tables_api.c:3586 [inline] nf_tables_newset+0xe62/0x3110 net/netfilter/nf_tables_api.c:4322 nfnetlink_rcv_batch+0xa09/0x24b0 net/netfilter/nfnetlink.c:488 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:612 [inline] nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:630 netlink_unicast_kernel net/netlink/af_netlink.c:1312 [inline] netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1338 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1927 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:674 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2350 ___sys_sendmsg+0xf3/0x170 net/socket.c:2404 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2433 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 Fixes: 0ed6389c483d ("netfilter: nf_tables: rename set implementations") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2021-05-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds340-2093/+1323
Merge yet more updates from Andrew Morton: "This is everything else from -mm for this merge window. 90 patches. Subsystems affected by this patch series: mm (cleanups and slub), alpha, procfs, sysctl, misc, core-kernel, bitmap, lib, compat, checkpatch, epoll, isofs, nilfs2, hpfs, exit, fork, kexec, gcov, panic, delayacct, gdb, resource, selftests, async, initramfs, ipc, drivers/char, and spelling" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (90 commits) mm: fix typos in comments mm: fix typos in comments treewide: remove editor modelines and cruft ipc/sem.c: spelling fix fs: fat: fix spelling typo of values kernel/sys.c: fix typo kernel/up.c: fix typo kernel/user_namespace.c: fix typos kernel/umh.c: fix some spelling mistakes include/linux/pgtable.h: few spelling fixes mm/slab.c: fix spelling mistake "disired" -> "desired" scripts/spelling.txt: add "overflw" scripts/spelling.txt: Add "diabled" typo scripts/spelling.txt: add "overlfow" arm: print alloc free paths for address in registers mm/vmalloc: remove vwrite() mm: remove xlate_dev_kmem_ptr() drivers/char: remove /dev/kmem for good mm: fix some typos and code style problems ipc/sem.c: mundane typo fixes ...
2021-05-07mm: fix typos in commentsLu Jialin3-3/+3
succed -> succeed in mm/hugetlb.c wil -> will in mm/mempolicy.c wit -> with in mm/page_alloc.c Retruns -> Returns in mm/page_vma_mapped.c confict -> conflict in mm/secretmem.c No functionality changed. Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com Signed-off-by: Lu Jialin <lujialin4@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>