summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-02-28KVM: apic: avoid calculating pending eoi from an uninitialized valMiaohe Lin1-1/+3
commit 23520b2def95205f132e167cf5b25c609975e959 upstream. When pv_eoi_get_user() fails, 'val' may remain uninitialized and the return value of pv_eoi_get_pending() becomes random. Fix the issue by initializing the variable. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: handle nested posted interrupts when apicv is disabled for L1Vitaly Kuznetsov4-10/+17
commit 91a5f413af596ad01097e59bf487eb07cb3f1331 upstream. Even when APICv is disabled for L1 it can (and, actually, is) still available for L2, this means we need to always call vmx_deliver_nested_posted_interrupt() when attempting an interrupt delivery. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Check IO instruction VM-exit conditionsOliver Upton1-7/+52
commit 35a571346a94fb93b5b3b6a599675ef3384bc75c upstream. Consult the 'unconditional IO exiting' and 'use IO bitmaps' VM-execution controls when checking instruction interception. If the 'use IO bitmaps' VM-execution control is 1, check the instruction access against the IO bitmaps to determine if the instruction causes a VM-exit. Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Refactor IO bitmap checks into helper functionOliver Upton1-13/+27
commit e71237d3ff1abf9f3388337cfebf53b96df2020d upstream. Checks against the IO bitmap are useful for both instruction emulation and VM-exit reflection. Refactor the IO bitmap checks into a helper function. Signed-off-by: Oliver Upton <oupton@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Don't emulate instructions in guest modePaolo Bonzini1-1/+1
[ Upstream commit 07721feee46b4b248402133228235318199b05ec ] vmx_check_intercept is not yet fully implemented. To avoid emulating instructions disallowed by the L1 hypervisor, refuse to emulate instructions by default. Cc: stable@vger.kernel.org [Made commit, added commit msg - Oliver] Signed-off-by: Oliver Upton <oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28KVM: x86: don't notify userspace IOAPIC on edge-triggered interrupt EOIMiaohe Lin1-1/+1
commit 7455a8327674e1a7c9a1f5dd1b0743ab6713f6d1 upstream. Commit 13db77347db1 ("KVM: x86: don't notify userspace IOAPIC on edge EOI") said, edge-triggered interrupts don't set a bit in TMR, which means that IOAPIC isn't notified on EOI. And var level indicates level-triggered interrupt. But commit 3159d36ad799 ("KVM: x86: use generic function for MSI parsing") replace var level with irq.level by mistake. Fix it by changing irq.level to irq.trig_mode. Cc: stable@vger.kernel.org Fixes: 3159d36ad799 ("KVM: x86: use generic function for MSI parsing") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/mce/amd: Fix kobject lifetimeThomas Gleixner1-6/+11
commit 51dede9c05df2b78acd6dcf6a17d21f0877d2d7b upstream. Accessing the MCA thresholding controls in sysfs concurrently with CPU hotplug can lead to a couple of KASAN-reported issues: BUG: KASAN: use-after-free in sysfs_file_ops+0x155/0x180 Read of size 8 at addr ffff888367578940 by task grep/4019 and BUG: KASAN: use-after-free in show_error_count+0x15c/0x180 Read of size 2 at addr ffff888368a05514 by task grep/4454 for example. Both result from the fact that the threshold block creation/teardown code frees the descriptor memory itself instead of defining proper ->release function and leaving it to the driver core to take care of that, after all sysfs accesses have completed. Do that and get rid of the custom freeing code, fixing the above UAFs in the process. [ bp: write commit message. ] Fixes: 95268664390b ("[PATCH] x86_64: mce_amd support for family 0x10 processors") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200214082801.13836-1-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/mce/amd: Publish the bank pointer only after setup has succeededBorislav Petkov1-17/+16
commit 6e5cf31fbe651bed7ba1df768f2e123531132417 upstream. threshold_create_bank() creates a bank descriptor per MCA error thresholding counter which can be controlled over sysfs. It publishes the pointer to that bank in a per-CPU variable and then goes on to create additional thresholding blocks if the bank has such. However, that creation of additional blocks in allocate_threshold_blocks() can fail, leading to a use-after-free through the per-CPU pointer. Therefore, publish that pointer only after all blocks have been setup successfully. Fixes: 019f34fccfd5 ("x86, MCE, AMD: Move shared bank to node descriptor") Reported-by: Saar Amar <Saar.Amar@microsoft.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200128140846.phctkvx5btiexvbx@kili.mountain Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28x86/decoder: Add TEST opcode to Group3-2Masami Hiramatsu1-1/+1
[ Upstream commit 8b7e20a7ba54836076ff35a28349dabea4cec48f ] Add TEST opcode to Group3-2 reg=001b as same as Group3-1 does. Commit 12a78d43de76 ("x86/decoder: Add new TEST instruction pattern") added a TEST opcode assignment to f6 XX/001/XXX (Group 3-1), but did not add f7 XX/001/XXX (Group 3-2). Actually, this TEST opcode variant (ModRM.reg /1) is not described in the Intel SDM Vol2 but in AMD64 Architecture Programmer's Manual Vol.3, Appendix A.2 Table A-6. ModRM.reg Extensions for the Primary Opcode Map. Without this fix, Randy found a warning by insn_decoder_test related to this issue as below. HOSTCC arch/x86/tools/insn_decoder_test HOSTCC arch/x86/tools/insn_sanity TEST posttest arch/x86/tools/insn_decoder_test: warning: Found an x86 instruction decoder bug, please report this. arch/x86/tools/insn_decoder_test: warning: ffffffff81000bf1: f7 0b 00 01 08 00 testl $0x80100,(%rbx) arch/x86/tools/insn_decoder_test: warning: objdump says 6 bytes, but insn_get_length() says 2 arch/x86/tools/insn_decoder_test: warning: Decoded and checked 11913894 instructions with 1 failures TEST posttest arch/x86/tools/insn_sanity: Success: decoded and checked 1000000 random instructions with 0 errors (seed:0x871ce29c) To fix this error, add the TEST opcode according to AMD64 APM Vol.3. [ bp: Massage commit message. ] Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Link: https://lkml.kernel.org/r/157966631413.9580.10311036595431878351.stgit@devnote2 Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgdArd Biesheuvel1-7/+1
[ Upstream commit 75fbef0a8b6b4bb19b9a91b5214f846c2dc5139e ] The following commit: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd()") modified kernel_map_pages_in_pgd() to manage writable permissions of memory mappings in the EFI page table in a different way, but in the process, it removed the ability to clear NX attributes from read-only mappings, by clobbering the clear mask if _PAGE_RW is not being requested. Failure to remove the NX attribute from read-only mappings is unlikely to be a security issue, but it does prevent us from tightening the permissions in the EFI page tables going forward, so let's fix it now. Fixes: 15f003d20782 ("x86/mm/pat: Don't implicitly allow _PAGE_RW in kernel_map_pages_in_pgd() Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200113172245.27925-5-ardb@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28efi/x86: Don't panic or BUG() on non-critical error conditionsArd Biesheuvel2-18/+19
[ Upstream commit e2d68a955e49d61fd0384f23e92058dc9b79be5e ] The logic in __efi_enter_virtual_mode() does a number of steps in sequence, all of which may fail in one way or the other. In most cases, we simply print an error and disable EFI runtime services support, but in some cases, we BUG() or panic() and bring down the system when encountering conditions that we could easily handle in the same way. While at it, replace a pointless page-to-virt-phys conversion with one that goes straight from struct page to physical. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Matthew Garrett <mjg59@google.com> Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20200103113953.9571-14-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28x86/vdso: Provide missing include fileValdis Klētnieks1-0/+1
[ Upstream commit bff47c2302cc249bcd550b17067f8dddbd4b6f77 ] When building with C=1, sparse issues a warning: CHECK arch/x86/entry/vdso/vdso32-setup.c arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static? Provide the missing header file. Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/36224.1575599767@turing-police Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28x86/sysfb: Fix check for bad VRAM sizeArvind Sankar1-1/+1
[ Upstream commit dacc9092336be20b01642afe1a51720b31f60369 ] When checking whether the reported lfb_size makes sense, the height * stride result is page-aligned before seeing whether it exceeds the reported size. This doesn't work if height * stride is not an exact number of pages. For example, as reported in the kernel bugzilla below, an 800x600x32 EFI framebuffer gets skipped because of this. Move the PAGE_ALIGN to after the check vs size. Reported-by: Christopher Head <chead@chead.ca> Tested-by: Christopher Head <chead@chead.ca> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=206051 Link: https://lkml.kernel.org/r/20200107230410.2291947-1-nivedita@alum.mit.edu Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28efi/x86: Map the entire EFI vendor string before copying itArd Biesheuvel1-6/+7
[ Upstream commit ffc2760bcf2dba0dbef74013ed73eea8310cc52c ] Fix a couple of issues with the way we map and copy the vendor string: - we map only 2 bytes, which usually works since you get at least a page, but if the vendor string happens to cross a page boundary, a crash will result - only call early_memunmap() if early_memremap() succeeded, or we will call it with a NULL address which it doesn't like, - while at it, switch to early_memremap_ro(), and array indexing rather than pointer dereferencing to read the CHAR16 characters. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Matthew Garrett <mjg59@google.com> Cc: linux-efi@vger.kernel.org Fixes: 5b83683f32b1 ("x86: EFI runtime service support") Link: https://lkml.kernel.org/r/20200103113953.9571-5-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28KVM: nVMX: Use correct root level for nested EPT shadow page tablesSean Christopherson1-0/+3
[ Upstream commit 148d735eb55d32848c3379e460ce365f2c1cbe4b ] Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU currently also hardcodes the page walk level for nested EPT to be 4 levels. The L2 guest is all but guaranteed to soft hang on its first instruction when L1 is using EPT, as KVM will construct 4-level page tables and then tell hardware to use 5-level page tables. Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28Revert "KVM: VMX: Add non-canonical check on writes to RTIT address MSRs"Sasha Levin1-8033/+0
This reverts commit 57211b7366cc2abf784c35e537b256e7fcddc91e. This patch isn't needed on 4.19 and older. Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28Revert "KVM: nVMX: Use correct root level for nested EPT shadow page tables"Sasha Levin1-3/+0
This reverts commit 740d876bd9565857a695ce7c05efda4eba5bc585. Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-28perf/x86/intel: Fix inaccurate period in context switch for auto-reloadKan Liang1-0/+2
commit f861854e1b435b27197417f6f90d87188003cb24 upstream. Perf doesn't take the left period into account when auto-reload is enabled with fixed period sampling mode in context switch. Here is the MSR trace of the perf command as below. (The MSR trace is simplified from a ftrace log.) #perf record -e cycles:p -c 2000000 -- ./triad_loop //The MSR trace of task schedule out //perf disable all counters, disable PEBS, disable GP counter 0, //read GP counter 0, and re-enable all counters. //The counter 0 stops at 0xfffffff82840 write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0 write_msr: MSR_P6_EVNTSEL0(186), value 40003003c rdpmc: 0, value fffffff82840 write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff //The MSR trace of the same task schedule in again //perf disable all counters, enable and set GP counter 0, //enable PEBS, and re-enable all counters. //0xffffffe17b80 (-2000000) is written to GP counter 0. write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 write_msr: MSR_IA32_PMC0(4c1), value ffffffe17b80 write_msr: MSR_P6_EVNTSEL0(186), value 40043003c write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1 write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff When the same task schedule in again, the counter should starts from previous left. However, it starts from the fixed period -2000000 again. A special variant of intel_pmu_save_and_restart() is used for auto-reload, which doesn't update the hwc->period_left. When the monitored task schedules in again, perf doesn't know the left period. The fixed period is used, which is inaccurate. With auto-reload, the counter always has a negative counter value. So the left period is -value. Update the period_left in intel_pmu_save_and_restart_reload(). With the patch: //The MSR trace of task schedule out write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 0 write_msr: MSR_P6_EVNTSEL0(186), value 40003003c rdpmc: 0, value ffffffe25cbc write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff //The MSR trace of the same task schedule in again write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 write_msr: MSR_IA32_PMC0(4c1), value ffffffe25cbc write_msr: MSR_P6_EVNTSEL0(186), value 40043003c write_msr: MSR_IA32_PEBS_ENABLE(3f1), value 1 write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value f000000ff Fixes: d31fc13fdcb2 ("perf/x86/intel: Fix event update for auto-reload") Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200121190125.3389-1-kan.liang@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28perf/x86/amd: Add missing L2 misses event spec to AMD Family 17h's event mapKim Phillips1-0/+1
commit 25d387287cf0330abf2aad761ce6eee67326a355 upstream. Commit 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h"), claimed L2 misses were unsupported, due to them not being found in its referenced documentation, whose link has now moved [1]. That old documentation listed PMCx064 unit mask bit 3 as: "LsRdBlkC: LS Read Block C S L X Change to X Miss." and bit 0 as: "IcFillMiss: IC Fill Miss" We now have new public documentation [2] with improved descriptions, that clearly indicate what events those unit mask bits represent: Bit 3 now clearly states: "LsRdBlkC: Data Cache Req Miss in L2 (all types)" and bit 0 is: "IcFillMiss: Instruction Cache Req Miss in L2." So we can now add support for L2 misses in perf's genericised events as PMCx064 with both the above unit masks. [1] The commit's original documentation reference, "Processor Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1 Processors", originally available here: https://www.amd.com/system/files/TechDocs/54945_PPR_Family_17h_Models_00h-0Fh.pdf is now available here: https://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf [2] "Processor Programming Reference (PPR) for Family 17h Model 31h, Revision B0 Processors", available here: https://developer.amd.com/wp-content/resources/55803_0.54-PUB.pdf Fixes: 3fe3331bb285 ("perf/x86/amd: Add event map for AMD Family 17h") Reported-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20200121171232.28839-1-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: nVMX: Use correct root level for nested EPT shadow page tablesSean Christopherson1-0/+3
commit 148d735eb55d32848c3379e460ce365f2c1cbe4b upstream. Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU currently also hardcodes the page walk level for nested EPT to be 4 levels. The L2 guest is all but guaranteed to soft hang on its first instruction when L1 is using EPT, as KVM will construct 4-level page tables and then tell hardware to use 5-level page tables. Fixes: 855feb673640 ("KVM: MMU: Add 5 level EPT & Shadow page table support.") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28KVM: x86: emulate RDPIDPaolo Bonzini3-2/+42
commit fb6d4d340e0532032c808a9933eaaa7b8de435ab upstream. This is encoded as F3 0F C7 /7 with a register argument. The register argument is the second array in the group9 GroupDual, while F3 is the fourth element of a Prefix. Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: Use vcpu-specific gva->hva translation when querying host page sizeSean Christopherson1-3/+3
[ Upstream commit f9b84e19221efc5f493156ee0329df3142085f28 ] Use kvm_vcpu_gfn_to_hva() when retrieving the host page size so that the correct set of memslots is used when handling x86 page faults in SMM. Fixes: 54bf36aac520 ("KVM: x86: use vcpu-specific functions to read/write/translate GFNs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: nVMX: vmread should not set rflags to specify success in case of #PFMiaohe Lin1-1/+3
[ Upstream commit a4d956b9390418623ae5d07933e2679c68b6f83c ] In case writing to vmread destination operand result in a #PF, vmread should not call nested_vmx_succeed() to set rflags to specify success. Similar to as done in VMPTRST (See handle_vmptrst()). Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: VMX: Add non-canonical check on writes to RTIT address MSRsSean Christopherson1-0/+8033
[ Upstream commit fe6ed369fca98e99df55c932b85782a5687526b5 ] Reject writes to RTIT address MSRs if the data being written is a non-canonical address as the MSRs are subject to canonical checks, e.g. KVM will trigger an unchecked #GP when loading the values to hardware during pt_guest_enter(). Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVMSean Christopherson1-1/+1
[ Upstream commit e30a7d623dccdb3f880fbcad980b0cb589a1da45 ] Remove the bogus 64-bit only condition from the check that disables MMIO spte optimization when the system supports the max PA, i.e. doesn't have any reserved PA bits. 32-bit KVM always uses PAE paging for the shadow MMU, and per Intel's SDM: PAE paging translates 32-bit linear addresses to 52-bit physical addresses. The kernel's restrictions on max physical addresses are limits on how much memory the kernel can reasonably use, not what physical addresses are supported by hardware. Fixes: ce88decffd17 ("KVM: MMU: mmio page fault support") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platformSean Christopherson1-0/+5
[ Upstream commit f958bd2314d117f8c29f4821401bc1925bc2e5ef ] Unlike most state managed by XSAVE, MPX is initialized to zero on INIT. Because INITs are usually recognized in the context of a VCPU_RUN call, kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident in memory, zeros the MPX state, and reloads FPU state to hardware. But, in the unlikely event that an INIT is recognized during kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(), kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly userspace's FPU state as well). Given that MPX is being removed from the kernel[*], fix the bug with the simple-but-ugly approach of loading the guest's FPU during KVM_GET_MP_STATE. [*] See commit f240652b6032b ("x86/mpx: Remove MPX APIs"). Fixes: f775b13eedee2 ("x86,kvm: move qemu/guest FPU switching out to vcpu_run") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: x86: Protect pmu_intel.c from Spectre-v1/L1TF attacksMarios Pomonis1-8/+16
[ Upstream commit 66061740f1a487f4ed54fde75e724709f805da53 ] This fixes Spectre-v1/L1TF vulnerabilities in intel_find_fixed_event() and intel_rdpmc_ecx_to_pmc(). kvm_rdpmc() (ancestor of intel_find_fixed_event()) and reprogram_fixed_counter() (ancestor of intel_rdpmc_ecx_to_pmc()) are exported symbols so KVM should treat them conservatively from a security perspective. Fixes: 25462f7f5295 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-15KVM: x86: Free wbinvd_dirty_mask if vCPU creation failsSean Christopherson1-1/+1
commit 16be9ddea268ad841457a59109963fff8c9de38d upstream. Free the vCPU's wbinvd_dirty_mask if vCPU creation fails after kvm_arch_vcpu_init(), e.g. when installing the vCPU's file descriptor. Do the freeing by calling kvm_arch_vcpu_free() instead of open coding the freeing. This adds a likely superfluous, but ultimately harmless, call to kvmclock_reset(), which only clears vcpu->arch.pv_time_enabled. Using kvm_arch_vcpu_free() allows for additional cleanup in the future. Fixes: f5f48ee15c2ee ("KVM: VMX: Execute WBINVD to keep data consistency with assigned devices") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() ↵Marios Pomonis1-2/+6
from Spectre-v1/L1TF attacks commit 25a5edea71b7c154b6a0b8cec14c711cafa31d26 upstream. This fixes a Spectre-v1/L1TF vulnerability in fixed_msr_to_seg_unit(). This function contains index computations based on the (attacker-controlled) MSR number. Fixes: de9aef5e1ad6 ("KVM: MTRR: introduce fixed_mtrr_segment table") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacksMarios Pomonis1-3/+8
commit 3c9053a2cae7ba2ba73766a34cea41baa70f57f7 upstream. This fixes a Spectre-v1/L1TF vulnerability in x86_decode_insn(). kvm_emulate_instruction() (an ancestor of x86_decode_insn()) is an exported symbol, so KVM should treat it conservatively from a security perspective. Fixes: 045a282ca415 ("KVM: emulator: implement fninit, fnstsw, fnstcw") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks ↵Marios Pomonis1-2/+8
in x86.c commit 6ec4c5eee1750d5d17951c4e1960d953376a0dda upstream. This fixes a Spectre-v1/L1TF vulnerability in set_msr_mce() and get_msr_mce(). Both functions contain index computations based on the (attacker-controlled) MSR number. Fixes: 890ca9aefa78 ("KVM: Add MCE support") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacksMarios Pomonis1-6/+8
commit 8c86405f606ca8508b8d9280680166ca26723695 upstream. This fixes a Spectre-v1/L1TF vulnerability in ioapic_read_indirect(). This function contains index computations based on the (attacker-controlled) IOREGSEL register. Fixes: a2c118bfab8b ("KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect MSR-based index computations in pmu.h from Spectre-v1/L1TF ↵Marios Pomonis1-4/+14
attacks commit 13c5183a4e643cc2b03a22d0e582c8e17bb7457d upstream. This fixes a Spectre-v1/L1TF vulnerability in the get_gp_pmc() and get_fixed_pmc() functions. They both contain index computations based on the (attacker-controlled) MSR number. Fixes: 25462f7f5295 ("KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect ioapic_write_indirect() from Spectre-v1/L1TF attacksMarios Pomonis1-0/+1
commit 670564559ca35b439c8d8861fc399451ddf95137 upstream. This fixes a Spectre-v1/L1TF vulnerability in ioapic_write_indirect(). This function contains index computations based on the (attacker-controlled) IOREGSEL register. This patch depends on patch "KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks". Fixes: 70f93dae32ac ("KVM: Use temporary variable to shorten lines.") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect kvm_hv_msr_[get|set]_crash_data() from Spectre-v1/L1TF attacksMarios Pomonis1-4/+6
commit 8618793750071d66028584a83ed0b4fa7eb4f607 upstream. This fixes Spectre-v1/L1TF vulnerabilities in kvm_hv_msr_get_crash_data() and kvm_hv_msr_set_crash_data(). These functions contain index computations that use the (attacker-controlled) MSR number. Fixes: e7d9513b60e8 ("kvm/x86: added hyper-v crash msrs into kvm hyperv context") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect kvm_lapic_reg_write() from Spectre-v1/L1TF attacksMarios Pomonis1-4/+9
commit 4bf79cb089f6b1c6c632492c0271054ce52ad766 upstream. This fixes a Spectre-v1/L1TF vulnerability in kvm_lapic_reg_write(). This function contains index computations based on the (attacker-controlled) MSR number. Fixes: 0105d1a52640 ("KVM: x2apic interface to lapic") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Protect DR-based index computations from Spectre-v1/L1TF attacksMarios Pomonis1-2/+6
commit ea740059ecb37807ba47b84b33d1447435a8d868 upstream. This fixes a Spectre-v1/L1TF vulnerability in __kvm_set_dr() and kvm_get_dr(). Both kvm_get_dr() and kvm_set_dr() (a wrapper of __kvm_set_dr()) are exported symbols so KVM should tream them conservatively from a security perspective. Fixes: 020df0794f57 ("KVM: move DR register access handling into generic code") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Refactor prefix decoding to prevent Spectre-v1/L1TF attacksMarios Pomonis1-2/+14
commit 125ffc5e0a56a3eded608dc51e09d5ebf72cf652 upstream. This fixes Spectre-v1/L1TF vulnerabilities in vmx_read_guest_seg_selector(), vmx_read_guest_seg_base(), vmx_read_guest_seg_limit() and vmx_read_guest_seg_ar(). When invoked from emulation, these functions contain index computations based on the (attacker-influenced) segment value. Using constants prevents the attack. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15KVM: x86: Refactor picdev_write() to prevent Spectre-v1/L1TF attacksMarios Pomonis1-1/+5
commit 14e32321f3606e4b0970200b6e5e47ee6f1e6410 upstream. This fixes a Spectre-v1/L1TF vulnerability in picdev_write(). It replaces index computations based on the (attacked-controlled) port number with constants through a minor refactoring. Fixes: 85f455f7ddbe ("KVM: Add support for in-kernel PIC emulation") Signed-off-by: Nick Finco <nifi@google.com> Signed-off-by: Marios Pomonis <pomonis@google.com> Reviewed-by: Andrew Honig <ahonig@google.com> Cc: stable@vger.kernel.org Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-15x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEARPawan Gupta1-6/+7
[ Upstream commit 5efc6fa9044c3356d6046c6e1da6d02572dbed6b ] /proc/cpuinfo currently reports Hardware Lock Elision (HLE) feature to be present on boot cpu even if it was disabled during the bootup. This is because cpuinfo_x86->x86_capability HLE bit is not updated after TSX state is changed via the new MSR IA32_TSX_CTRL. Update the cached HLE bit also since it is expected to change after an update to CPUID_CLEAR bit in MSR IA32_TSX_CTRL. Fixes: 95c5824f75f3 ("x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default") Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/2529b99546294c893dfa1c89e2b3e46da3369a59.1578685425.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-05x86/resctrl: Fix a deadlock due to inaccurate referenceXiaochen Shen1-8/+8
commit 334b0f4e9b1b4a1d475f803419d202f6c5e4d18e upstream. There is a race condition which results in a deadlock when rmdir and mkdir execute concurrently: $ ls /sys/fs/resctrl/c1/mon_groups/m1/ cpus cpus_list mon_data tasks Thread 1: rmdir /sys/fs/resctrl/c1 Thread 2: mkdir /sys/fs/resctrl/c1/mon_groups/m1 3 locks held by mkdir/48649: #0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50 #1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c13b>] filename_create+0x7b/0x170 #2: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70 4 locks held by rmdir/48652: #0: (sb_writers#17){.+.+}, at: [<ffffffffb4ca2aa0>] mnt_want_write+0x20/0x50 #1: (&type->i_mutex_dir_key#8/1){+.+.}, at: [<ffffffffb4c8c3cf>] do_rmdir+0x13f/0x1e0 #2: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffb4c86d5d>] vfs_rmdir+0x4d/0x120 #3: (rdtgroup_mutex){+.+.}, at: [<ffffffffb4a4389d>] rdtgroup_kn_lock_live+0x3d/0x70 Thread 1 is deleting control group "c1". Holding rdtgroup_mutex, kernfs_remove() removes all kernfs nodes under directory "c1" recursively, then waits for sub kernfs node "mon_groups" to drop active reference. Thread 2 is trying to create a subdirectory "m1" in the "mon_groups" directory. The wrapper kernfs_iop_mkdir() takes an active reference to the "mon_groups" directory but the code drops the active reference to the parent directory "c1" instead. As a result, Thread 1 is blocked on waiting for active reference to drop and never release rdtgroup_mutex, while Thread 2 is also blocked on trying to get rdtgroup_mutex. Thread 1 (rdtgroup_rmdir) Thread 2 (rdtgroup_mkdir) (rmdir /sys/fs/resctrl/c1) (mkdir /sys/fs/resctrl/c1/mon_groups/m1) ------------------------- ------------------------- kernfs_iop_mkdir /* * kn: "m1", parent_kn: "mon_groups", * prgrp_kn: parent_kn->parent: "c1", * * "mon_groups", parent_kn->active++: 1 */ kernfs_get_active(parent_kn) kernfs_iop_rmdir /* "c1", kn->active++ */ kernfs_get_active(kn) rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) /* "c1", kn->active-- */ kernfs_break_active_protection(kn) mutex_lock rdtgroup_rmdir_ctrl free_all_child_rdtgrp sentry->flags = RDT_DELETED rdtgroup_ctrl_remove rdtgrp->flags = RDT_DELETED kernfs_get(kn) kernfs_remove(rdtgrp->kn) __kernfs_remove /* "mon_groups", sub_kn */ atomic_add(KN_DEACTIVATED_BIAS, &sub_kn->active) kernfs_drain(sub_kn) /* * sub_kn->active == KN_DEACTIVATED_BIAS + 1, * waiting on sub_kn->active to drop, but it * never drops in Thread 2 which is blocked * on getting rdtgroup_mutex. */ Thread 1 hangs here ----> wait_event(sub_kn->active == KN_DEACTIVATED_BIAS) ... rdtgroup_mkdir rdtgroup_mkdir_mon(parent_kn, prgrp_kn) mkdir_rdt_prepare(parent_kn, prgrp_kn) rdtgroup_kn_lock_live(prgrp_kn) atomic_inc(&rdtgrp->waitcount) /* * "c1", prgrp_kn->active-- * * The active reference on "c1" is * dropped, but not matching the * actual active reference taken * on "mon_groups", thus causing * Thread 1 to wait forever while * holding rdtgroup_mutex. */ kernfs_break_active_protection( prgrp_kn) /* * Trying to get rdtgroup_mutex * which is held by Thread 1. */ Thread 2 hangs here ----> mutex_lock ... The problem is that the creation of a subdirectory in the "mon_groups" directory incorrectly releases the active protection of its parent directory instead of itself before it starts waiting for rdtgroup_mutex. This is triggered by the rdtgroup_mkdir() flow calling rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() with kernfs node of the parent control group ("c1") as argument. It should be called with kernfs node "mon_groups" instead. What is currently missing is that the kn->priv of "mon_groups" is NULL instead of pointing to the rdtgrp. Fix it by pointing kn->priv to rdtgrp when "mon_groups" is created. Then it could be passed to rdtgroup_kn_lock_live()/rdtgroup_kn_unlock() instead. And then it operates on the same rdtgroup structure but handles the active reference of kernfs node "mon_groups" to prevent deadlock. The same changes are also made to the "mon_data" directories. This results in some unused function parameters that will be cleaned up in follow-up patch as the focus here is on the fix only in support of backporting efforts. Backporting notes: Since upstream commit fa7d949337cc ("x86/resctrl: Rename and move rdt files to a separate directory"), the file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to arch/x86/kernel/cpu/resctrl/rdtgroup.c. Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c for older stable trees. Fixes: c7d9aac61311 ("x86/intel_rdt/cqm: Add mkdir support for RDT monitoring") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1578500886-21771-4-git-send-email-xiaochen.shen@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-05x86/resctrl: Fix use-after-free due to inaccurate refcount of rdtgroupXiaochen Shen1-5/+5
commit 074fadee59ee7a9d2b216e9854bd4efb5dad679f upstream. There is a race condition in the following scenario which results in an use-after-free issue when reading a monitoring file and deleting the parent ctrl_mon group concurrently: Thread 1 calls atomic_inc() to take refcount of rdtgrp and then calls kernfs_break_active_protection() to drop the active reference of kernfs node in rdtgroup_kn_lock_live(). In Thread 2, kernfs_remove() is a blocking routine. It waits on all sub kernfs nodes to drop the active reference when removing all subtree kernfs nodes recursively. Thread 2 could block on kernfs_remove() until Thread 1 calls kernfs_break_active_protection(). Only after kernfs_remove() completes the refcount of rdtgrp could be trusted. Before Thread 1 calls atomic_inc() and kernfs_break_active_protection(), Thread 2 could call kfree() when the refcount of rdtgrp (sentry) is 0 instead of 1 due to the race. In Thread 1, in rdtgroup_kn_unlock(), referring to earlier rdtgrp memory (rdtgrp->waitcount) which was already freed in Thread 2 results in use-after-free issue. Thread 1 (rdtgroup_mondata_show) Thread 2 (rdtgroup_rmdir) -------------------------------- ------------------------- rdtgroup_kn_lock_live /* * kn active protection until * kernfs_break_active_protection(kn) */ rdtgrp = kernfs_to_rdtgroup(kn) rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) mutex_lock rdtgroup_rmdir_ctrl free_all_child_rdtgrp /* * sentry->waitcount should be 1 * but is 0 now due to the race. */ kfree(sentry)*[1] /* * Only after kernfs_remove() * completes, the refcount of * rdtgrp could be trusted. */ atomic_inc(&rdtgrp->waitcount) /* kn->active-- */ kernfs_break_active_protection(kn) rdtgroup_ctrl_remove rdtgrp->flags = RDT_DELETED /* * Blocking routine, wait for * all sub kernfs nodes to drop * active reference in * kernfs_break_active_protection. */ kernfs_remove(rdtgrp->kn) rdtgroup_kn_unlock mutex_unlock atomic_dec_and_test( &rdtgrp->waitcount) && (flags & RDT_DELETED) kernfs_unbreak_active_protection(kn) kfree(rdtgrp) mutex_lock mon_event_read rdtgroup_kn_unlock mutex_unlock /* * Use-after-free: refer to earlier rdtgrp * memory which was freed in [1]. */ atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) /* kn->active++ */ kernfs_unbreak_active_protection(kn) kfree(rdtgrp) Fix it by moving free_all_child_rdtgrp() to after kernfs_remove() in rdtgroup_rmdir_ctrl() to ensure it has the accurate refcount of rdtgrp. Backporting notes: Since upstream commit fa7d949337cc ("x86/resctrl: Rename and move rdt files to a separate directory"), the file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to arch/x86/kernel/cpu/resctrl/rdtgroup.c. Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c for older stable trees. Upstream commit 17eafd076291 ("x86/intel_rdt: Split resource group removal in two") moved part of resource group removal code from rdtgroup_rmdir_mon() into a separate function rdtgroup_ctrl_remove(). Apply the change against original code base of rdtgroup_rmdir_mon() for older stable trees. Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1578500886-21771-3-git-send-email-xiaochen.shen@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-05x86/resctrl: Fix use-after-free when deleting resource groupsXiaochen Shen1-2/+10
commit b8511ccc75c033f6d54188ea4df7bf1e85778740 upstream. A resource group (rdtgrp) contains a reference count (rdtgrp->waitcount) that indicates how many waiters expect this rdtgrp to exist. Waiters could be waiting on rdtgroup_mutex or some work sitting on a task's workqueue for when the task returns from kernel mode or exits. The deletion of a rdtgrp is intended to have two phases: (1) while holding rdtgroup_mutex the necessary cleanup is done and rdtgrp->flags is set to RDT_DELETED, (2) after releasing the rdtgroup_mutex, the rdtgrp structure is freed only if there are no waiters and its flag is set to RDT_DELETED. Upon gaining access to rdtgroup_mutex or rdtgrp, a waiter is required to check for the RDT_DELETED flag. When unmounting the resctrl file system or deleting ctrl_mon groups, all of the subdirectories are removed and the data structure of rdtgrp is forcibly freed without checking rdtgrp->waitcount. If at this point there was a waiter on rdtgrp then a use-after-free issue occurs when the waiter starts running and accesses the rdtgrp structure it was waiting on. See kfree() calls in [1], [2] and [3] in these two call paths in following scenarios: (1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp() There are several scenarios that result in use-after-free issue in following: Scenario 1: ----------- In Thread 1, rdtgroup_tasks_write() adds a task_work callback move_myself(). If move_myself() is scheduled to execute after Thread 2 rdt_kill_sb() is finished, referring to earlier rdtgrp memory (rdtgrp->waitcount) which was already freed in Thread 2 results in use-after-free issue. Thread 1 (rdtgroup_tasks_write) Thread 2 (rdt_kill_sb) ------------------------------- ---------------------- rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) mutex_lock rdtgroup_move_task __rdtgroup_move_task /* * Take an extra refcount, so rdtgrp cannot be freed * before the call back move_myself has been invoked */ atomic_inc(&rdtgrp->waitcount) /* Callback move_myself will be scheduled for later */ task_work_add(move_myself) rdtgroup_kn_unlock mutex_unlock atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) mutex_lock rmdir_all_sub /* * sentry and rdtgrp are freed * without checking refcount */ free_all_child_rdtgrp kfree(sentry)*[1] kfree(rdtgrp)*[2] mutex_unlock /* * Callback is scheduled to execute * after rdt_kill_sb is finished */ move_myself /* * Use-after-free: refer to earlier rdtgrp * memory which was freed in [1] or [2]. */ atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) kfree(rdtgrp) Scenario 2: ----------- In Thread 1, rdtgroup_tasks_write() adds a task_work callback move_myself(). If move_myself() is scheduled to execute after Thread 2 rdtgroup_rmdir() is finished, referring to earlier rdtgrp memory (rdtgrp->waitcount) which was already freed in Thread 2 results in use-after-free issue. Thread 1 (rdtgroup_tasks_write) Thread 2 (rdtgroup_rmdir) ------------------------------- ------------------------- rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) mutex_lock rdtgroup_move_task __rdtgroup_move_task /* * Take an extra refcount, so rdtgrp cannot be freed * before the call back move_myself has been invoked */ atomic_inc(&rdtgrp->waitcount) /* Callback move_myself will be scheduled for later */ task_work_add(move_myself) rdtgroup_kn_unlock mutex_unlock atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) rdtgroup_kn_lock_live atomic_inc(&rdtgrp->waitcount) mutex_lock rdtgroup_rmdir_ctrl free_all_child_rdtgrp /* * sentry is freed without * checking refcount */ kfree(sentry)*[3] rdtgroup_ctrl_remove rdtgrp->flags = RDT_DELETED rdtgroup_kn_unlock mutex_unlock atomic_dec_and_test( &rdtgrp->waitcount) && (flags & RDT_DELETED) kfree(rdtgrp) /* * Callback is scheduled to execute * after rdt_kill_sb is finished */ move_myself /* * Use-after-free: refer to earlier rdtgrp * memory which was freed in [3]. */ atomic_dec_and_test(&rdtgrp->waitcount) && (flags & RDT_DELETED) kfree(rdtgrp) If CONFIG_DEBUG_SLAB=y, Slab corruption on kmalloc-2k can be observed like following. Note that "0x6b" is POISON_FREE after kfree(). The corrupted bits "0x6a", "0x64" at offset 0x424 correspond to waitcount member of struct rdtgroup which was freed: Slab corruption (Not tainted): kmalloc-2k start=ffff9504c5b0d000, len=2048 420: 6b 6b 6b 6b 6a 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkjkkkkkkkkkkk Single bit error detected. Probably bad RAM. Run memtest86+ or a similar memory test tool. Next obj: start=ffff9504c5b0d800, len=2048 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Slab corruption (Not tainted): kmalloc-2k start=ffff9504c58ab800, len=2048 420: 6b 6b 6b 6b 64 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkdkkkkkkkkkkk Prev obj: start=ffff9504c58ab000, len=2048 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Fix this by taking reference count (waitcount) of rdtgrp into account in the two call paths that currently do not do so. Instead of always freeing the resource group it will only be freed if there are no waiters on it. If there are waiters, the resource group will have its flags set to RDT_DELETED. It will be left to the waiter to free the resource group when it starts running and finding that it was the last waiter and the resource group has been removed (rdtgrp->flags & RDT_DELETED) since. (1) rdt_kill_sb() -> rmdir_all_sub() -> free_all_child_rdtgrp() (2) rdtgroup_rmdir() -> rdtgroup_rmdir_ctrl() -> free_all_child_rdtgrp() Backporting notes: Since upstream commit fa7d949337cc ("x86/resctrl: Rename and move rdt files to a separate directory"), the file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c has been renamed and moved to arch/x86/kernel/cpu/resctrl/rdtgroup.c. Apply the change against file arch/x86/kernel/cpu/intel_rdt_rdtgroup.c in older stable trees. Fixes: f3cbeacaa06e ("x86/intel_rdt/cqm: Add rmdir support") Fixes: 60cf5e101fd4 ("x86/intel_rdt: Add mkdir to resctrl file system") Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1578500886-21771-2-git-send-email-xiaochen.shen@intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27x86, perf: Fix the dependency of the x86 insn decoder selftestMasami Hiramatsu1-1/+1
[ Upstream commit 7720804a2ae46c90265a32c81c45fb6f8d2f4e8b ] Since x86 instruction decoder is not only for kprobes, it should be tested when the insn.c is compiled. (e.g. perf is enabled but kprobes is disabled) Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: cbe5c34c8c1f ("x86: Compile insn.c and inat.c only for KPROBES") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27x86/kgbd: Use NMI_VECTOR not APIC_DM_NMIThomas Gleixner1-1/+1
[ Upstream commit 2591bc4e8d70b4e1330d327fb7e3921f4e070a51 ] apic->send_IPI_allbutself() takes a vector number as argument. APIC_DM_NMI is clearly not a vector number. It's defined to 0x400 which is outside the vector space. Use NMI_VECTOR instead as that's what it is intended to be. Fixes: 82da3ff89dc2 ("x86: kgdb support") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190722105218.855189979@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-27x86/mm: Remove unused variable 'cpu'Qian Cai1-3/+0
[ Upstream commit 3609e31bc8dc03b701390f79c74fc7fe92b95039 ] The commit a2055abe9c67 ("x86/mm: Pass flush_tlb_info to flush_tlb_others() etc") removed the unnecessary cpu parameter from uv_flush_tlb_others() but left an unused variable. arch/x86/mm/tlb.c: In function 'native_flush_tlb_others': arch/x86/mm/tlb.c:688:16: warning: variable 'cpu' set but not used [-Wunused-but-set-variable] unsigned int cpu; ^~~ Fixes: a2055abe9c67 ("x86/mm: Pass flush_tlb_info to flush_tlb_others() etc") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Andyt Lutomirski <luto@kernel.org> Cc: dave.hansen@linux.intel.com Cc: peterz@infradead.org Cc: bp@alien8.de Cc: hpa@zytor.com Link: https://lkml.kernel.org/r/20190228220155.88124-1-cai@lca.pw Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-23x86/efistub: Disable paging at mixed mode entryArd Biesheuvel1-0/+5
commit 4911ee401b7ceff8f38e0ac597cbf503d71e690c upstream. The EFI mixed mode entry code goes through the ordinary startup_32() routine before jumping into the kernel's EFI boot code in 64-bit mode. The 32-bit startup code must be entered with paging disabled, but this is not documented as a requirement for the EFI handover protocol, and so we should disable paging explicitly when entering the kernel from 32-bit EFI firmware. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Cc: <stable@vger.kernel.org> Cc: Arvind Sankar <nivedita@alum.mit.edu> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20191224132909.102540-4-ardb@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-23x86/resctrl: Fix an imbalance in domain_remove_cpu()Qian Cai1-1/+1
commit e278af89f1ba0a9ef20947db6afc2c9afa37e85b upstream. A system that supports resource monitoring may have multiple resources while not all of these resources are capable of monitoring. Monitoring related state is initialized only for resources that are capable of monitoring and correspondingly this state should subsequently only be removed from these resources that are capable of monitoring. domain_add_cpu() calls domain_setup_mon_state() only when r->mon_capable is true where it will initialize d->mbm_over. However, domain_remove_cpu() calls cancel_delayed_work(&d->mbm_over) without checking r->mon_capable resulting in an attempt to cancel d->mbm_over on all resources, even those that never initialized d->mbm_over because they are not capable of monitoring. Hence, it triggers a debugobjects warning when offlining CPUs because those timer debugobjects are never initialized: ODEBUG: assert_init not available (active state 0) object type: timer_list hint: 0x0 WARNING: CPU: 143 PID: 789 at lib/debugobjects.c:484 debug_print_object Hardware name: HP Synergy 680 Gen9/Synergy 680 Gen9 Compute Module, BIOS I40 05/23/2018 RIP: 0010:debug_print_object Call Trace: debug_object_assert_init del_timer try_to_grab_pending cancel_delayed_work resctrl_offline_cpu cpuhp_invoke_callback cpuhp_thread_fun smpboot_thread_fn kthread ret_from_fork Fixes: e33026831bdb ("x86/intel_rdt/mbm: Handle counter overflow") Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: john.stultz@linaro.org Cc: sboyd@kernel.org Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: tj@kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191211033042.2188-1-cai@lca.pw Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-12perf/x86/intel: Fix PT PMI handlingAlexander Shishkin1-2/+7
[ Upstream commit 92ca7da4bdc24d63bb0bcd241c11441ddb63b80a ] Commit: ccbebba4c6bf ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it") skips the PT/LBR exclusivity check on CPUs where PT and LBRs coexist, but also inadvertently skips the active_events bump for PT in that case, which is a bug. If there aren't any hardware events at the same time as PT, the PMI handler will ignore PT PMIs, as active_events reads zero in that case, resulting in the "Uhhuh" spurious NMI warning and PT data loss. Fix this by always increasing active_events for PT events. Fixes: ccbebba4c6bf ("perf/x86/intel/pt: Bypass PT vs. LBR exclusivity if the core supports it") Reported-by: Vitaly Slobodskoy <vitaly.slobodskoy@intel.com> Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Link: https://lkml.kernel.org/r/20191210105101.77210-1-alexander.shishkin@linux.intel.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-12x86/efi: Update e820 with reserved EFI boot services data to fix kexec breakageDave Young1-4/+2
[ Upstream commit af164898482817a1d487964b68f3c21bae7a1beb ] Michael Weiser reported that he got this error during a kexec rebooting: esrt: Unsupported ESRT version 2904149718861218184. The ESRT memory stays in EFI boot services data, and it was reserved in kernel via efi_mem_reserve(). The initial purpose of the reservation is to reuse the EFI boot services data across kexec reboot. For example the BGRT image data and some ESRT memory like Michael reported. But although the memory is reserved it is not updated in the X86 E820 table, and kexec_file_load() iterates system RAM in the IO resource list to find places for kernel, initramfs and other stuff. In Michael's case the kexec loaded initramfs overwrote the ESRT memory and then the failure happened. Since kexec_file_load() depends on the E820 table being updated, just fix this by updating the reserved EFI boot services memory as reserved type in E820. Originally any memory descriptors with EFI_MEMORY_RUNTIME attribute are bypassed in the reservation code path because they are assumed as reserved. But the reservation is still needed for multiple kexec reboots, and it is the only possible case we come here thus just drop the code chunk, then everything works without side effects. On my machine the ESRT memory sits in an EFI runtime data range, it does not trigger the problem, but I successfully tested with BGRT instead. both kexec_load() and kexec_file_load() work and kdump works as well. [ mingo: Edited the changelog. ] Reported-by: Michael Weiser <michael@weiser.dinsnail.net> Tested-by: Michael Weiser <michael@weiser.dinsnail.net> Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kexec@lists.infradead.org Cc: linux-efi@vger.kernel.org Link: https://lkml.kernel.org/r/20191204075233.GA10520@dhcp-128-65.nay.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>