summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)AuthorFilesLines
2020-08-18perf/x86/intel: Support TopDown metrics on Ice LakeKan Liang4-0/+135
Ice Lake supports the hardware TopDown metrics feature, which can free up the scarce GP counters. Update the event constraints for the metrics events. The metric counters do not exist, which are mapped to a dummy offset. The sharing between multiple users of the same metric without multiplexing is not allowed. Implement set_topdown_event_period for Ice Lake. The values in PERF_METRICS MSR are derived from the fixed counter 3. Both registers should start from zero. Implement update_topdown_event for Ice Lake. The metric is reported by multiplying the metric (fraction) with slots. To maintain accurate measurements, both registers are cleared for each update. The fixed counter 3 should always be cleared before the PERF_METRICS. Implement td_attr for the new metrics events and the new slots fixed counter. Make them visible to the perf user tools. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-11-kan.liang@linux.intel.com
2020-08-18perf/x86: Add a macro for RDPMC offset of fixed countersKan Liang2-1/+5
The RDPMC base offset of fixed counters is hard-code. Use a meaningful name to replace the magic number to improve the readability of the code. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-10-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Generic support for hardware TopDown metricsKan Liang5-15/+257
Intro ===== The TopDown Microarchitecture Analysis (TMA) Method is a structured analysis methodology to identify critical performance bottlenecks in out-of-order processors. Current perf has supported the method. The method works well, but there is one problem. To collect the TopDown events, several GP counters have to be used. If a user wants to collect other events at the same time, the multiplexing probably be triggered, which impacts the accuracy. To free up the scarce GP counters, the hardware TopDown metrics feature is introduced from Ice Lake. The hardware implements an additional "metrics" register and a new Fixed Counter 3 that measures pipeline "slots". The TopDown events can be calculated from them instead. Events ====== The level 1 TopDown has four metrics. There is no event-code assigned to the TopDown metrics. Four metric events are exported as separate perf events, which map to the internal "metrics" counter register. Those events do not exist in hardware, but can be allocated by the scheduler. For the event mapping, a special 0x00 event code is used, which is reserved for fake events. The metric events start from umask 0x10. When setting up the metric events, they point to the Fixed Counter 3. They have to be specially handled. - Add the update_topdown_event() callback to read the additional metrics MSR and generate the metrics. - Add the set_topdown_event_period() callback to initialize metrics MSR and the fixed counter 3. - Add a variable n_metric_event to track the number of the accepted metrics events. The sharing between multiple users of the same metric without multiplexing is not allowed. - Only enable/disable the fixed counter 3 when there are no other active TopDown events, which avoid the unnecessary writing of the fixed control register. - Disable the PMU when reading the metrics event. The metrics MSR and the fixed counter 3 are read separately. The values may be modified by an NMI. All four metric events don't support sampling. Since they will be handled specially for event update, a flag PERF_X86_EVENT_TOPDOWN is introduced to indicate this case. The slots event can support both sampling and counting. For counting, the flag is also applied. For sampling, it will be handled normally as other normal events. Groups ====== The slots event is required in a Topdown group. To avoid reading the METRICS register multiple times, the metrics and slots value can only be updated by slots event in a group. All active slots and metrics events will be updated one time. Therefore, the slots event must be before any metric events in a Topdown group. NMI ====== The METRICS related register may be overflow. The bit 48 of the STATUS register will be set. If so, PERF_METRICS and Fixed counter 3 are required to be reset. The patch also update all active slots and metrics events in the NMI handler. The update_topdown_event() has to read two registers separately. The values may be modified by an NMI. PMU has to be disabled before calling the function. RDPMC ====== RDPMC is temporarily disabled. A later patch will enable it. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-9-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Use switch in intel_pmu_disable/enable_eventKan Liang1-8/+28
Currently, the if-else is used in the intel_pmu_disable/enable_event to check the type of an event. It works well, but with more and more types added later, e.g., perf metrics, compared to the switch statement, the if-else may impair the readability of the code. There is no harm to use the switch statement to replace the if-else here. Also, some optimizing compilers may compile a switch statement into a jump-table which is more efficient than if-else for a large number of cases. The performance gain may not be observed for now, because the number of cases is only 5, but the benefits may be observed with more and more types added in the future. Use switch to replace the if-else in the intel_pmu_disable/enable_event. If the idx is invalid, print a warning. For the case INTEL_PMC_IDX_FIXED_BTS in intel_pmu_disable_event, don't need to check the event->attr.precise_ip. Use return for the case. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-7-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Fix the name of perf METRICSKan Liang1-1/+1
Bit 15 of the PERF_CAPABILITIES MSR indicates that the perf METRICS feature is supported. The perf METRICS is not a PEBS feature. Rename pebs_metrics_available perf_metrics. The bit is not used in the current code. It will be used in a later patch. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-6-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Move BTS index to 47Kan Liang1-2/+2
The bit 48 in the PERF_GLOBAL_STATUS is used to indicate the overflow status of the PERF_METRICS counters. Move the BTS index to the bit 47. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-5-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Introduce the fourth fixed counterKan Liang1-3/+20
The fourth fixed counter, TOPDOWN.SLOTS, is introduced in Ice Lake to measure the level 1 TopDown events. Add MSR address and macros for the new fixed counter, which will be used in a later patch. Add comments to explain the event encoding rules for the fixed counters. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-4-kan.liang@linux.intel.com
2020-08-18perf/x86/intel: Name the global status bit in NMI handlerKan Liang2-12/+14
Magic numbers are used in the current NMI handler for the global status bit. Use a meaningful name to replace the magic numbers to improve the readability of the code. Remove a Tab for all GLOBAL_STATUS_* and INTEL_PMC_IDX_FIXED_BTS macros to reduce the length of the line. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-3-kan.liang@linux.intel.com
2020-08-18perf/x86: Use event_base_rdpmc for the RDPMC userspace supportKan Liang1-8/+3
The RDPMC index is always re-calculated for the RDPMC userspace support, which is unnecessary. The RDPMC index value is stored in the variable event_base_rdpmc for the kernel usage, which can be used for RDPMC userspace support as well. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200723171117.9918-2-kan.liang@linux.intel.com
2020-08-18x86/cpu: Use XGETBV and XSETBV mnemonics in fpu/internal.hUros Bizjak1-5/+2
Current minimum required version of binutils is 2.23, which supports XGETBV and XSETBV instruction mnemonics. Replace the byte-wise specification of XGETBV and XSETBV with these proper mnemonics. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200707174722.58651-1-ubizjak@gmail.com
2020-08-17kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE modeJim Mattson1-1/+1
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest") Cc: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE modeJim Mattson1-1/+1
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17KVM: x86: fix access code passed to gva_to_gpaPaolo Bonzini1-1/+3
The PK bit of the error code is computed dynamically in permission_fault and therefore need not be passed to gva_to_gpa: only the access bits (fetch, user, write) need to be passed down. Not doing so causes a splat in the pku test: WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020 RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm] Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89 RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000 RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000 FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: paging64_gva_to_gpa+0x3f/0xb0 [kvm] kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm] handle_exception_nmi+0x4fc/0x5b0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm] kvm_vcpu_ioctl+0x23e/0x5d0 [kvm] ksys_ioctl+0x92/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace d17eb998aee991da ]--- Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17x86/cpu: Use SERIALIZE in sync_core() when availableRicardo Neri2-8/+24
The SERIALIZE instruction gives software a way to force the processor to complete all modifications to flags, registers and memory from previous instructions and drain all buffered writes to memory before the next instruction is fetched and executed. Thus, it serves the purpose of sync_core(). Use it when available. Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200807032833.17484-1-ricardo.neri-calderon@linux.intel.com
2020-08-15perf/x86/intel/uncore: Add BW counters for GT, IA and IO breakdownVaibhav Shankar1-3/+49
Linux only has support to read total DDR reads and writes. Here we add support to enable bandwidth breakdown-GT, IA and IO. Breakdown of BW is important to debug and optimize memory access. This can also be used for telemetry and improving the system software.The offsets for GT, IA and IO are added and these free running counters can be accessed via MMIO space. The BW breakdown can be measured using the following cmd: perf stat -e uncore_imc/gt_requests/,uncore_imc/ia_requests/,uncore_imc/io_requests/ 30.57 MiB uncore_imc/gt_requests/ 1346.13 MiB uncore_imc/ia_requests/ 190.97 MiB uncore_imc/io_requests/ 5.984572733 seconds time elapsed BW/s = <gt,ia,io>_requests/time elapsed Signed-off-by: Vaibhav Shankar <vaibhav.shankar@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200814022234.23605-1-vaibhav.shankar@intel.com
2020-08-15Merge tag 'x86-urgent-2020-08-15' of ↵Linus Torvalds7-17/+58
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes and small updates all around the place: - Fix mitigation state sysfs output - Fix an FPU xstate/sxave code assumption bug triggered by Architectural LBR support - Fix Lightning Mountain SoC TSC frequency enumeration bug - Fix kexec debug output - Fix kexec memory range assumption bug - Fix a boundary condition in the crash kernel code - Optimize porgatory.ro generation a bit - Enable ACRN guests to use X2APIC mode - Reduce a __text_poke() IRQs-off critical section for the benefit of PREEMPT_RT" * tag 'x86-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/alternatives: Acquire pte lock with interrupts enabled x86/bugs/multihit: Fix mitigation reporting when VMX is not in use x86/fpu/xstate: Fix an xstate size check warning with architectural LBRs x86/purgatory: Don't generate debug info for purgatory.ro x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC kexec_file: Correctly output debugging information for the PT_LOAD ELF header kexec: Improve & fix crash_exclude_mem_range() to handle overlapping ranges x86/crash: Correct the address boundary of function parameters x86/acrn: Remove redundant chars from ACRN signature x86/acrn: Allow ACRN guest to use X2APIC mode
2020-08-15Merge tag 'perf-urgent-2020-08-15' of ↵Linus Torvalds1-10/+36
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes, an expansion of perf syscall access to CAP_PERFMON privileged tools, plus a RAPL HW-enablement for Intel SPR platforms" * tag 'perf-urgent-2020-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/rapl: Add support for Intel SPR platform perf/x86/rapl: Support multiple RAPL unit quirks perf/x86/rapl: Fix missing psys sysfs attributes hw_breakpoint: Remove unused __register_perf_hw_breakpoint() declaration kprobes: Remove show_registers() function prototype perf/core: Take over CAP_SYS_PTRACE creds to CAP_PERFMON capability
2020-08-15x86/mm/64: Update comment in preallocate_vmalloc_pages()Joerg Roedel1-5/+10
The comment explaining why 4-level systems only need to allocate on the P4D level caused some confustion. Update it to better explain why on 4-level systems the allocation on PUD level is necessary. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200814151947.26229-3-joro@8bytes.org
2020-08-15x86/mm/64: Do not sync vmalloc/ioremap mappingsJoerg Roedel2-7/+0
Remove the code to sync the vmalloc and ioremap ranges for x86-64. The page-table pages are all pre-allocated so that synchronization is no longer necessary. This is a patch that already went into the kernel as: commit 8bb9bf242d1f ("x86/mm/64: Do not sync vmalloc/ioremap mappings") But it had to be reverted later because it unveiled a bug from: commit 6eb82f994026 ("x86/mm: Pre-allocate P4D/PUD pages for vmalloc area") The bug in that commit causes the P4D/PUD pages not to be correctly allocated, making the synchronization still necessary. That issue got fixed meanwhile upstream: commit 995909a4e22b ("x86/mm/64: Do not dereference non-present PGD entries") With that fix it is safe again to remove the page-table synchronization for vmalloc/ioremap ranges on x86-64. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200814151947.26229-2-joro@8bytes.org
2020-08-15x86/paravirt: Avoid needless paravirt step clearing page table entriesJuergen Gross1-6/+6
pte_clear() et al are based on two paravirt steps today: one step to create a page table entry with all zeroes, and one step to write this entry value. Drop the first step as it is completely useless. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-7-jgross@suse.com
2020-08-15x86/paravirt: Remove set_pte_at() pv-opJuergen Gross5-22/+4
On x86 set_pte_at() is now always falling back to set_pte(). So instead of having this fallback after the paravirt maze just drop the set_pte_at paravirt operation and let set_pte_at() use the set_pte() function directly. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-6-jgross@suse.com
2020-08-15x86/entry/32: Simplify CONFIG_XEN_PV build dependencyJuergen Gross1-2/+2
With 32-bit Xen PV support gone, the following commit is not needed anymore: a4c0e91d1d65bc58 ("x86/entry/32: Fix XEN_PV build dependency") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-5-jgross@suse.com
2020-08-15x86/paravirt: Use CONFIG_PARAVIRT_XXL instead of CONFIG_PARAVIRTJuergen Gross3-4/+4
There are some code parts using CONFIG_PARAVIRT for Xen pvops related issues instead of the more stringent CONFIG_PARAVIRT_XXL. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-4-jgross@suse.com
2020-08-15x86/paravirt: Clean up paravirt macrosJuergen Gross1-15/+0
Some paravirt macros are no longer used, delete them. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-3-jgross@suse.com
2020-08-15x86/paravirt: Remove 32-bit support from CONFIG_PARAVIRT_XXLJuergen Gross11-189/+13
The last 32-bit user of stuff under CONFIG_PARAVIRT_XXL is gone. Remove 32-bit specific parts. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200815100641.26362-2-jgross@suse.com
2020-08-15all arch: remove system call sys_sysctlXiaoming Ni2-2/+2
Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"), sys_sysctl is actually unavailable: any input can only return an error. We have been warning about people using the sysctl system call for years and believe there are no more users. Even if there are users of this interface if they have not complained or fixed their code by now they probably are not going to, so there is no point in warning them any longer. So completely remove sys_sysctl on all architectures. [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> [arm/arm64] Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bin Meng <bin.meng@windriver.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: chenzefeng <chenzefeng2@huawei.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Diego Elio Pettenò <flameeyes@flameeyes.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kars de Jong <jongk@linux-m68k.org> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Olof Johansson <olof@lixom.net> Cc: Paul Burton <paulburton@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Sven Schnelle <svens@stackframe.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com> Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-15Merge tag 'timers-urgent-2020-08-14' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timekeeping updates from Thomas Gleixner: "A set of timekeeping/VDSO updates: - Preparatory work to allow S390 to switch over to the generic VDSO implementation. S390 requires that the VDSO data pointer is handed in to the counter read function when time namespace support is enabled. Adding the pointer is a NOOP for all other architectures because the compiler is supposed to optimize that out when it is unused in the architecture specific inline. The change also solved a similar problem for MIPS which fortunately has time namespaces not yet enabled. S390 needs to update clock related VDSO data independent of the timekeeping updates. This was solved so far with yet another sequence counter in the S390 implementation. A better solution is to utilize the already existing VDSO sequence count for this. The core code now exposes helper functions which allow to serialize against the timekeeper code and against concurrent readers. S390 needs extra data for their clock readout function. The initial common VDSO data structure did not provide a way to add that. It now has an embedded architecture specific struct embedded which defaults to an empty struct. Doing this now avoids tree dependencies and conflicts post rc1 and allows all other architectures which work on generic VDSO support to work from a common upstream base. - A trivial comment fix" * tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: time: Delete repeated words in comments lib/vdso: Allow to add architecture-specific vdso data timekeeping/vsyscall: Provide vdso_update_begin/end() vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
2020-08-15Merge tag 'timers-core-2020-08-14' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more timer updates from Thomas Gleixner: "A set of posix CPU timer changes which allows to defer the heavy work of posix CPU timers into task work context. The tick interrupt is reduced to a quick check which queues the work which is doing the heavy lifting before returning to user space or going back to guest mode. Moving this out is deferring the signal delivery slightly but posix CPU timers are inaccurate by nature as they depend on the tick so there is no real damage. The relevant test cases all passed. This lifts the last offender for RT out of the hard interrupt context tick handler, but it also has the general benefit that the actual heavy work is accounted to the task/process and not to the tick interrupt itself. Further optimizations are possible to break long sighand lock hold and interrupt disabled (on !RT kernels) times when a massive amount of posix CPU timers (which are unpriviledged) is armed for a task/process. This is currently only enabled for x86 because the architecture has to ensure that task work is handled in KVM before entering a guest, which was just established for x86 with the new common entry/exit code which got merged post 5.8 and is not the case for other KVM architectures" * tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Select POSIX_CPU_TIMERS_TASK_WORK posix-cpu-timers: Provide mechanisms to defer timer handling to task_work posix-cpu-timers: Split run_posix_cpu_timers()
2020-08-14Merge tag 'for-linus-5.9-rc1b-tag' of ↵Linus Torvalds19-1126/+285
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull more xen updates from Juergen Gross: - Remove support for running as 32-bit Xen PV-guest. 32-bit PV guests are rarely used, are lacking security fixes for Meltdown, and can be easily replaced by PVH mode. Another series for doing more cleanup will follow soon (removal of 32-bit-only pvops functionality). - Fixes and additional features for the Xen display frontend driver. * tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: drm/xen-front: Pass dumb buffer data offset to the backend xen: Sync up with the canonical protocol definition in Xen drm/xen-front: Add YUYV to supported formats drm/xen-front: Fix misused IS_ERR_OR_NULL checks xen/gntdev: Fix dmabuf import with non-zero sgt offset x86/xen: drop tests for highmem in pv code x86/xen: eliminate xen-asm_64.S x86/xen: remove 32-bit Xen PV guest support
2020-08-14Merge tag 'hyperv-fixes-signed' of ↵Linus Torvalds2-7/+12
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyper-v fixes from Wei Liu: - fix oops reporting on Hyper-V - make objtool happy * tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: x86/hyperv: Make hv_setup_sched_clock inline Drivers: hv: vmbus: Only notify Hyper-V for die events that are oops
2020-08-14x86/fsgsbase/64: Fix NULL deref in 86_fsgsbase_read_taskEric Dumazet1-1/+1
syzbot found its way in 86_fsgsbase_read_task() and triggered this oops: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 6866 Comm: syz-executor262 Not tainted 5.8.0-syzkaller #0 RIP: 0010:x86_fsgsbase_read_task+0x16d/0x310 arch/x86/kernel/process_64.c:393 Call Trace: putreg32+0x3ab/0x530 arch/x86/kernel/ptrace.c:876 genregs32_set arch/x86/kernel/ptrace.c:1026 [inline] genregs32_set+0xa4/0x100 arch/x86/kernel/ptrace.c:1006 copy_regset_from_user include/linux/regset.h:326 [inline] ia32_arch_ptrace arch/x86/kernel/ptrace.c:1061 [inline] compat_arch_ptrace+0x36c/0xd90 arch/x86/kernel/ptrace.c:1198 __do_compat_sys_ptrace kernel/ptrace.c:1420 [inline] __se_compat_sys_ptrace kernel/ptrace.c:1389 [inline] __ia32_compat_sys_ptrace+0x220/0x2f0 kernel/ptrace.c:1389 do_syscall_32_irqs_on arch/x86/entry/common.c:84 [inline] __do_fast_syscall_32+0x57/0x80 arch/x86/entry/common.c:126 do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:149 entry_SYSENTER_compat_after_hwframe+0x4d/0x5c This can happen if ptrace() or sigreturn() pokes an LDT selector into FS or GS for a task with no LDT and something tries to read the base before a return to usermode notices the bad selector and fixes it. The fix is to make sure ldt pointer is not NULL. Fixes: 07e1d88adaae ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately") Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Markus T Metzger <markus.t.metzger@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-14x86/boot: Check that there are no run-time relocationsArvind Sankar2-25/+11
Add a linker script check that there are no run-time relocations, and remove the old one that tries to check via looking for specially-named sections in the object files. Drop the tests for -fPIE compiler option and -pie linker option, as they are available in all supported gcc and binutils versions (as well as clang and lld). Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com> Link: https://lore.kernel.org/r/20200731230820.1742553-8-keescook@chromium.org
2020-08-14x86/boot: Remove run-time relocations from head_{32,64}.SArvind Sankar4-21/+18
The BFD linker generates run-time relocations for z_input_len and z_output_len, even though they are absolute symbols. This is fixed for binutils-2.35 [1]. Work around this for earlier versions by defining two variables input_len and output_len in addition to the symbols, and use them via position-independent references. This eliminates the last two run-time relocations in the head code and allows us to drop the -z noreloc-overflow flag to the linker. Move the -pie and --no-dynamic-linker LDFLAGS to LDFLAGS_vmlinux instead of KBUILD_LDFLAGS. There shouldn't be anything else getting linked, but this is the more logical location for these flags, and modversions might call the linker if an EXPORT_SYMBOL is left over accidentally in one of the decompressors. [1] https://sourceware.org/bugzilla/show_bug.cgi?id=25754 Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Link: https://lore.kernel.org/r/20200731230820.1742553-7-keescook@chromium.org
2020-08-14x86/boot: Remove run-time relocations from .head.text codeArvind Sankar2-78/+90
The assembly code in head_{32,64}.S, while meant to be position-independent, generates run-time relocations because it uses instructions such as: leal gdt(%edx), %eax which make the assembler and linker think that the code is using %edx as an index into gdt, and hence gdt needs to be relocated to its run-time address. On 32-bit, with lld Dmitry Golovin reports that this results in a link-time error with default options (i.e. unless -z notext is explicitly passed): LD arch/x86/boot/compressed/vmlinux ld.lld: error: can't create dynamic relocation R_386_32 against local symbol in readonly segment; recompile object files with -fPIC or pass '-Wl,-z,notext' to allow text relocations in the output With the BFD linker, this generates a warning during the build, if --warn-shared-textrel is enabled, which at least Gentoo enables by default: LD arch/x86/boot/compressed/vmlinux ld: arch/x86/boot/compressed/head_32.o: warning: relocation in read-only section `.head.text' ld: warning: creating a DT_TEXTREL in object On 64-bit, it is not possible to link the kernel as -pie with lld, and it is only possible with a BFD linker that supports -z noreloc-overflow, i.e. versions >2.26. This is because these instructions cannot really be relocated: the displacement field is only 32-bits wide, and thus cannot be relocated for a 64-bit load address. The -z noreloc-overflow option simply overrides the linker error, and results in R_X86_64_RELATIVE relocations that apply a 64-bit relocation to a 32-bit field anyway. This happens to work because nothing will process these run-time relocations. Start fixing this by removing relocations from .head.text: - On 32-bit, use a base register that holds the address of the GOT and reference symbol addresses using @GOTOFF, i.e. leal gdt@GOTOFF(%edx), %eax - On 64-bit, most of the code can (and already does) use %rip-relative addressing, however the .code32 bits can't, and the 64-bit code also needs to reference symbol addresses as they will be after moving the compressed kernel to the end of the decompression buffer. For these cases, reference the symbols as an offset to startup_32 to avoid creating relocations, i.e.: leal (gdt-startup_32)(%bp), %eax This only works in .head.text as the subtraction cannot be represented as a PC-relative relocation unless startup_32 is in the same section as the code. Move efi32_pe_entry into .head.text so that it can use the same method to avoid relocations. Reported-by: Dmitry Golovin <dima@golovin.in> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Link: https://lore.kernel.org/r/20200731230820.1742553-6-keescook@chromium.org
2020-08-14x86/boot: Add .text.* to setup.ldArvind Sankar1-1/+1
GCC puts the main function into .text.startup when compiled with -Os (or -O2). This results in arch/x86/boot/main.c having a .text.startup section which is currently not included explicitly in the linker script setup.ld in the same directory. The BFD linker places this orphan section immediately after .text, so this still works. However, LLD git, since [1], is choosing to place it immediately after the .bstext section instead (this is the first code section). This plays havoc with the section layout that setup.elf requires to create the setup header, for eg on 64-bit: LD arch/x86/boot/setup.elf ld.lld: error: section .text.startup file range overlaps with .header >>> .text.startup range is [0x200040, 0x2001FE] >>> .header range is [0x2001EF, 0x20026B] ld.lld: error: section .header file range overlaps with .bsdata >>> .header range is [0x2001EF, 0x20026B] >>> .bsdata range is [0x2001FF, 0x200398] ld.lld: error: section .bsdata file range overlaps with .entrytext >>> .bsdata range is [0x2001FF, 0x200398] >>> .entrytext range is [0x20026C, 0x2002D3] ld.lld: error: section .text.startup virtual address range overlaps with .header >>> .text.startup range is [0x40, 0x1FE] >>> .header range is [0x1EF, 0x26B] ld.lld: error: section .header virtual address range overlaps with .bsdata >>> .header range is [0x1EF, 0x26B] >>> .bsdata range is [0x1FF, 0x398] ld.lld: error: section .bsdata virtual address range overlaps with .entrytext >>> .bsdata range is [0x1FF, 0x398] >>> .entrytext range is [0x26C, 0x2D3] ld.lld: error: section .text.startup load address range overlaps with .header >>> .text.startup range is [0x40, 0x1FE] >>> .header range is [0x1EF, 0x26B] ld.lld: error: section .header load address range overlaps with .bsdata >>> .header range is [0x1EF, 0x26B] >>> .bsdata range is [0x1FF, 0x398] ld.lld: error: section .bsdata load address range overlaps with .entrytext >>> .bsdata range is [0x1FF, 0x398] >>> .entrytext range is [0x26C, 0x2D3] Add .text.* to the .text output section to fix this, and also prevent any future surprises if the compiler decides to create other such sections. [1] https://reviews.llvm.org/D75225 Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Fangrui Song <maskray@google.com> Link: https://lore.kernel.org/r/20200731230820.1742553-5-keescook@chromium.org
2020-08-14x86/boot/compressed: Get rid of GOT fixup codeArd Biesheuvel3-80/+5
In a previous patch, we have eliminated GOT entries from the decompressor binary and added an assertion that the .got section is empty. This means that the GOT fixup routines that exist in both the 32-bit and 64-bit startup routines have become dead code, and can be removed. While at it, drop the KEEP() from the linker script, as it has no effect on the contents of output sections that are created by the linker itself. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200731230820.1742553-4-keescook@chromium.org
2020-08-14x86/boot/compressed: Force hidden visibility for all symbol referencesArd Biesheuvel2-0/+2
Eliminate all GOT entries in the decompressor binary, by forcing hidden visibility for all symbol references, which informs the compiler that such references will be resolved at link time without the need for allocating GOT entries. To ensure that no GOT entries will creep back in, add an assertion to the decompressor linker script that will fire if the .got section has a non-zero size. [Arvind: move hidden.h to include/linux instead of making a copy] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200731230820.1742553-3-keescook@chromium.org
2020-08-14x86/boot/compressed: Move .got.plt entries out of the .got sectionArd Biesheuvel1-1/+10
The .got.plt section contains the part of the GOT which is used by PLT entries, and which gets updated lazily by the dynamic loader when function calls are dispatched through those PLT entries. On fully linked binaries such as the kernel proper or the decompressor, this never happens, and so in practice, the .got.plt section consists only of the first 3 magic entries that are meant to point at the _DYNAMIC section and at the fixup routine in the loader. However, since we don't use a dynamic loader, those entries are never populated or used. This means that treating those entries like ordinary GOT entries, and updating their values based on the actual placement of the executable in memory is completely pointless, and we can just ignore the .got.plt section entirely, provided that it has no additional entries beyond the first 3 ones. So add an assertion in the linker script to ensure that this assumption holds, and move the contents out of the [_got, _egot) memory range that is modified by the GOT fixup routines. While at it, drop the KEEP(), since it has no effect on the contents of output sections that are created by the linker itself. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Arvind Sankar <nivedita@alum.mit.edu> Link: https://lore.kernel.org/r/20200731230820.1742553-2-keescook@chromium.org
2020-08-14perf/x86/rapl: Add support for Intel SPR platformZhang Rui1-0/+20
Intel SPR platform uses fixed 16 bit energy unit for DRAM RAPL domain, and fixed 0 bit energy unit for Psys RAPL domain. After this, on SPR platform the energy counters appear in perf list. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Acked-by: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20200811153149.12242-4-rui.zhang@intel.com
2020-08-14perf/x86/rapl: Support multiple RAPL unit quirksZhang Rui1-9/+15
There will be more platforms with different fixed energy units. Enhance the code to support different RAPL unit quirks for different platforms. Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Link: https://lore.kernel.org/r/20200811153149.12242-3-rui.zhang@intel.com
2020-08-14perf/x86/rapl: Fix missing psys sysfs attributesZhang Rui1-1/+1
This fixes a problem introduced by commit: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") that perf event sysfs attributes for psys RAPL domain are missing. Fixes: 5fb5273a905c ("perf/x86/rapl: Use new MSR detection interface") Signed-off-by: Zhang Rui <rui.zhang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kan Liang <kan.liang@linux.intel.com> Reviewed-by: Len Brown <len.brown@intel.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/r/20200811153149.12242-2-rui.zhang@intel.com
2020-08-13x86/alternatives: Acquire pte lock with interrupts enabledSebastian Andrzej Siewior1-3/+3
pte lock is never acquired in-IRQ context so it does not require interrupts to be disabled. The lock is a regular spinlock which cannot be acquired with interrupts disabled on RT. RT complains about pte_lock() in __text_poke() because it's invoked after disabling interrupts. __text_poke() has to disable interrupts as use_temporary_mm() expects interrupts to be off because it invokes switch_mm_irqs_off() and uses per-CPU (current active mm) data. Move the PTE lock handling outside the interrupt disabled region. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by; Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200813105026.bvugytmsso6muljw@linutronix.de
2020-08-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-13/+15
Pull more KVM updates from Paolo Bonzini: "PPC: - Improvements and bugfixes for secure VM support, giving reduced startup time and memory hotplug support. - Locking fixes in nested KVM code - Increase number of guests supported by HV KVM to 4094 - Preliminary POWER10 support ARM: - Split the VHE and nVHE hypervisor code bases, build the EL2 code separately, allowing for the VHE code to now be built with instrumentation - Level-based TLB invalidation support - Restructure of the vcpu register storage to accomodate the NV code - Pointer Authentication available for guests on nVHE hosts - Simplification of the system register table parsing - MMU cleanups and fixes - A number of post-32bit cleanups and other fixes MIPS: - compilation fixes x86: - bugfixes - support for the SERIALIZE instruction" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits) KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled MIPS: KVM: Convert a fallthrough comment to fallthrough MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64 x86: Expose SERIALIZE for supported cpuid KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort() KVM: arm64: Don't skip cache maintenance for read-only memslots KVM: arm64: Handle data and instruction external aborts the same way KVM: arm64: Rename kvm_vcpu_dabt_isextabt() KVM: arm: Add trace name for ARM_NISV KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE KVM: PPC: Book3S HV: Rework secure mem slot dropping KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up KVM: PPC: Book3S HV: Migrate hot plugged memory KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START ...
2020-08-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-17/+12
Merge more updates from Andrew Morton: - most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction, mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util, memory-hotplug, cleanups, uaccess, migration, gup, pagemap), - various other subsystems (alpha, misc, sparse, bitmap, lib, bitops, checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump, exec, kdump, rapidio, panic, kcov, kgdb, ipc). * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits) mm/gup: remove task_struct pointer for all gup code mm: clean up the last pieces of page fault accountings mm/xtensa: use general page fault accounting mm/x86: use general page fault accounting mm/sparc64: use general page fault accounting mm/sparc32: use general page fault accounting mm/sh: use general page fault accounting mm/s390: use general page fault accounting mm/riscv: use general page fault accounting mm/powerpc: use general page fault accounting mm/parisc: use general page fault accounting mm/openrisc: use general page fault accounting mm/nios2: use general page fault accounting mm/nds32: use general page fault accounting mm/mips: use general page fault accounting mm/microblaze: use general page fault accounting mm/m68k: use general page fault accounting mm/ia64: use general page fault accounting mm/hexagon: use general page fault accounting mm/csky: use general page fault accounting ...
2020-08-12mm/x86: use general page fault accountingPeter Xu1-15/+2
Use the general page fault accounting by passing regs into handle_mm_fault(). Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/20200707225021.200906-23-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm: do page fault accounting in handle_mm_faultPeter Xu1-1/+1
Patch series "mm: Page fault accounting cleanups", v5. This is v5 of the pf accounting cleanup series. It originates from Gerald Schaefer's report on an issue a week ago regarding to incorrect page fault accountings for retried page fault after commit 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times"): https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/ What this series did: - Correct page fault accounting: we do accounting for a page fault (no matter whether it's from #PF handling, or gup, or anything else) only with the one that completed the fault. For example, page fault retries should not be counted in page fault counters. Same to the perf events. - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf event is used in an adhoc way across different archs. Case (1): for many archs it's done at the entry of a page fault handler, so that it will also cover e.g. errornous faults. Case (2): for some other archs, it is only accounted when the page fault is resolved successfully. Case (3): there're still quite some archs that have not enabled this perf event. Since this series will touch merely all the archs, we unify this perf event to always follow case (1), which is the one that makes most sense. And since we moved the accounting into handle_mm_fault, the other two MAJ/MIN perf events are well taken care of naturally. - Unify definition of "major faults": the definition of "major fault" is slightly changed when used in accounting (not VM_FAULT_MAJOR). More information in patch 1. - Always account the page fault onto the one that triggered the page fault. This does not matter much for #PF handlings, but mostly for gup. More information on this in patch 25. Patchset layout: Patch 1: Introduced the accounting in handle_mm_fault(), not enabled. Patch 2-23: Enable the new accounting for arch #PF handlers one by one. Patch 24: Enable the new accounting for the rest outliers (gup, iommu, etc.) Patch 25: Cleanup GUP task_struct pointer since it's not needed any more This patch (of 25): This is a preparation patch to move page fault accountings into the general code in handle_mm_fault(). This includes both the per task flt_maj/flt_min counters, and the major/minor page fault perf events. To do this, the pt_regs pointer is passed into handle_mm_fault(). PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault handlers. So far, all the pt_regs pointer that passed into handle_mm_fault() is NULL, which means this patch should have no intented functional change. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12uaccess: remove segment_eqChristoph Hellwig1-1/+1
segment_eq is only used to implement uaccess_kernel. Just open code uaccess_kernel in the arch uaccess headers and remove one layer of indirection. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Greentime Hu <green.hu@gmail.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Link: http://lkml.kernel.org/r/20200710135706.537715-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12mm/memory_hotplug: introduce default dummy memory_add_physaddr_to_nid()Jia He1-1/+0
This is to introduce a general dummy helper. memory_add_physaddr_to_nid() is a fallback option to get the nid in case NUMA_NO_NID is detected. After this patch, arm64/sh/s390 can simply use the general dummy version. PowerPC/x86/ia64 will still use their specific version. This is the preparation to set a fallback value for dev_dax->target_node. Signed-off-by: Jia He <justin.he@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Baoquan He <bhe@redhat.com> Cc: Chuhong Yuan <hslester96@gmail.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com> Cc: Kaly Xin <Kaly.Xin@arm.com> Link: http://lkml.kernel.org/r/20200710031619.18762-2-justin.he@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12x86/mm: use max memory block size on bare metalDaniel Jordan1-0/+9
Some of our servers spend significant time at kernel boot initializing memory block sysfs directories and then creating symlinks between them and the corresponding nodes. The slowness happens because the machines get stuck with the smallest supported memory block size on x86 (128M), which results in 16,288 directories to cover the 2T of installed RAM. The search for each memory block is noticeable even with commit 4fb6eabf1037 ("drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup"). Commit 078eb6aa50dc ("x86/mm/memory_hotplug: determine block size based on the end of boot memory") chooses the block size based on alignment with memory end. That addresses hotplug failures in qemu guests, but for bare metal systems whose memory end isn't aligned to even the smallest size, it leaves them at 128M. Make kernels that aren't running on a hypervisor use the largest supported size (2G) to minimize overhead on big machines. Kernel boot goes 7% faster on the aforementioned servers, shaving off half a second. [daniel.m.jordan@oracle.com: v3] Link: http://lkml.kernel.org/r/20200714205450.945834-1-daniel.m.jordan@oracle.com Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200609225451.3542648-1-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-12Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds1-2/+10
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...