summaryrefslogtreecommitdiff
path: root/arch/x86/mm
AgeCommit message (Collapse)AuthorFilesLines
2023-08-16x86/pkeys: Revert a5eff7259790 ("x86/pkeys: Add PKRU value to init_fpstate")Thomas Gleixner1-6/+0
commit b3607269ff57fd3c9690cb25962c5e4b91a0fd3b upstream. This cannot work and it's unclear how that ever made a difference. init_fpstate.xsave.header.xfeatures is always 0 so get_xsave_addr() will always return a NULL pointer, which will prevent storing the default PKRU value in init_fpstate. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121451.451391598@linutronix.de Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/mm: Use mm_alloc() in poking_init()Peter Zijlstra1-1/+2
commit 3f4c8211d982099be693be9aa7d6fc4607dff290 upstream. Instead of duplicating init_mm, allocate a fresh mm. The advantage is that mm_alloc() has much simpler dependencies. Additionally it makes more conceptual sense, init_mm has no (and must not have) user state to duplicate. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20221025201057.816175235@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/mm: fix poking_init() for Xen PV guestsJuergen Gross1-0/+4
commit 26ce6ec364f18d2915923bc05784084e54a5c4cc upstream. Commit 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") broke the kernel for running as Xen PV guest. It seems as if the new address space is never activated before being used, resulting in Xen rejecting to accept the new CR3 value (the PGD isn't pinned). Fix that by adding the now missing call of paravirt_arch_dup_mmap() to poking_init(). That call was previously done by dup_mm()->dup_mmap() and it is a NOP for all cases but for Xen PV, where it is just doing the pinning of the PGD. Fixes: 3f4c8211d982 ("x86/mm: Use mm_alloc() in poking_init()") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20230109150922.10578-1-jgross@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-27mm: rename p4d_page_vaddr to p4d_pgtable and make it return pud_t *Aneesh Kumar K.V1-2/+2
[ Upstream commit dc4875f0e791de554bdc45aa1dbd6e45e107e50f ] No functional change in this patch. [aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot] Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 0da90af431ab ("powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27mm: rename pud_page_vaddr to pud_pgtable and make it return pmd_t *Aneesh Kumar K.V2-3/+3
[ Upstream commit 9cf6fa2458443118b84090aa1bf7a3630b5940e8 ] No functional change in this patch. [aneesh.kumar@linux.ibm.com: fix] Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com [sfr@canb.auug.org.au: another fix] Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/ Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 0da90af431ab ("powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-28x86/mm: Avoid using set_pgd() outside of real PGD pagesLee Jones1-4/+4
commit d082d48737c75d2b3cc1f972b8c8674c25131534 upstream. KPTI keeps around two PGDs: one for userspace and another for the kernel. Among other things, set_pgd() contains infrastructure to ensure that updates to the kernel PGD are reflected in the user PGD as well. One side-effect of this is that set_pgd() expects to be passed whole pages. Unfortunately, init_trampoline_kaslr() passes in a single entry: 'trampoline_pgd_entry'. When KPTI is on, set_pgd() will update 'trampoline_pgd_entry' (an 8-Byte globally stored [.bss] variable) and will then proceed to replicate that value into the non-existent neighboring user page (located +4k away), leading to the corruption of other global [.bss] stored variables. Fix it by directly assigning 'trampoline_pgd_entry' and avoiding set_pgd(). [ dhansen: tweak subject and changelog ] Fixes: 0925dda5962e ("x86/mm/KASLR: Use only one PUD entry for real mode trampoline") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Lee Jones <lee@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/all/20230614163859.924309-1-lee@kernel.org/g Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-30x86/mm: Avoid incomplete Global INVLPG flushesDave Hansen1-0/+25
commit ce0b15d11ad837fbacc5356941712218e38a0a83 upstream. The INVLPG instruction is used to invalidate TLB entries for a specified virtual address. When PCIDs are enabled, INVLPG is supposed to invalidate TLB entries for the specified address for both the current PCID *and* Global entries. (Note: Only kernel mappings set Global=1.) Unfortunately, some INVLPG implementations can leave Global translations unflushed when PCIDs are enabled. As a workaround, never enable PCIDs on affected processors. I expect there to eventually be microcode mitigations to replace this software workaround. However, the exact version numbers where that will happen are not known today. Once the version numbers are set in stone, the processor list can be tweaked to only disable PCIDs on affected processors with affected microcode. Note: if anyone wants a quick fix that doesn't require patching, just stick 'nopcid' on your kernel command-line. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-22x86/mm: Fix use of uninitialized buffer in sme_enable()Nikita Zhandarovich1-1/+2
commit cbebd68f59f03633469f3ecf9bea99cd6cce3854 upstream. cmdline_find_option() may fail before doing any initialization of the buffer array. This may lead to unpredictable results when the same buffer is used later in calls to strncmp() function. Fix the issue by returning early if cmdline_find_option() returns an error. Found by Linux Verification Center (linuxtesting.org) with static analysis tool SVACE. Fixes: aca20d546214 ("x86/mm: Add support to make use of Secure Memory Encryption") Signed-off-by: Nikita Zhandarovich <n.zhandarovich@fintech.ru> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20230306160656.14844-1-n.zhandarovich@fintech.ru Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-02x86/ioremap: Fix page aligned size calculation in __ioremap_caller()Michael Kelley1-1/+7
commit 4dbd6a3e90e03130973688fd79e19425f720d999 upstream. Current code re-calculates the size after aligning the starting and ending physical addresses on a page boundary. But the re-calculation also embeds the masking of high order bits that exceed the size of the physical address space (via PHYSICAL_PAGE_MASK). If the masking removes any high order bits, the size calculation results in a huge value that is likely to immediately fail. Fix this by re-calculating the page-aligned size first. Then mask any high order bits using PHYSICAL_PAGE_MASK. Fixes: ffa71f33a820 ("x86, ioremap: Fix incorrect physical address handling in PAE mode") Signed-off-by: Michael Kelley <mikelley@microsoft.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/1668624097-14884-2-git-send-email-mikelley@microsoft.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-25x86/mm: Use proper mask when setting PUD mappingAaron Lu1-1/+1
commit 88e0a74902f894fbbc55ad3ad2cb23b4bfba555c upstream. Commit c164fbb40c43f("x86/mm: thread pgprot_t through init_memory_mapping()") mistakenly used __pgprot() which doesn't respect __default_kernel_pte_mask when setting PUD mapping. Fix it by only setting the one bit we actually need (PSE) and leaving the other bits (that have been properly masked) alone. Fixes: c164fbb40c43 ("x86/mm: thread pgprot_t through init_memory_mapping()") Signed-off-by: Aaron Lu <aaron.lu@intel.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21x86/numa: Use cpumask_available instead of hardcoded NULL checkSiddh Raman Pant1-2/+2
[ Upstream commit 625395c4a0f4775e0fe00f616888d2e6c1ba49db ] GCC-12 started triggering a new warning: arch/x86/mm/numa.c: In function ‘cpumask_of_node’: arch/x86/mm/numa.c:916:39: warning: the comparison will always evaluate as ‘false’ for the address of ‘node_to_cpumask_map’ will never be NULL [-Waddress] 916 | if (node_to_cpumask_map[node] == NULL) { | ^~ node_to_cpumask_map is of type cpumask_var_t[]. When CONFIG_CPUMASK_OFFSTACK is set, cpumask_var_t is typedef'd to a pointer for dynamic allocation, else to an array of one element. The "wicked game" can be checked on line 700 of include/linux/cpumask.h. The original code in debug_cpumask_set_cpu() and cpumask_of_node() were probably written by the original authors with CONFIG_CPUMASK_OFFSTACK=y (i.e. dynamic allocation) in mind, checking if the cpumask was available via a direct NULL check. When CONFIG_CPUMASK_OFFSTACK is not set, GCC gives the above warning while compiling the kernel. Fix that by using cpumask_available(), which does the NULL check when CONFIG_CPUMASK_OFFSTACK is set, otherwise returns true. Use it wherever such checks are made. Conditional definitions of cpumask_available() can be found along with the definition of cpumask_var_t. Check the cpumask.h reference mentioned above. Fixes: c032ef60d1aa ("cpumask: convert node_to_cpumask_map[] to cpumask_var_t") Fixes: de2d9445f162 ("x86: Unify node_to_cpumask_map handling between 32 and 64bit") Signed-off-by: Siddh Raman Pant <code@siddh.me> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20220731160913.632092-1-code@siddh.me Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-07-25objtool: Update Retpoline validationPeter Zijlstra1-0/+2
commit 9bb2ec608a209018080ca262f771e6a9ff203b6f upstream. Update retpoline validation with the new CONFIG_RETPOLINE requirement of not having bare naked RET instructions. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> [cascardo: conflict fixup at arch/x86/xen/xen-head.S] Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/sev: Avoid using __x86_return_thunkKim Phillips1-2/+6
commit 0ee9073000e8791f8b134a8ded31bcc767f7f232 upstream. Specifically, it's because __enc_copy() encrypts the kernel after being relocated outside the kernel in sme_encrypt_execute(), and the RET macro's jmp offset isn't amended prior to execution. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86: Prepare asm files for straight-line-speculationPeter Zijlstra1-2/+2
commit f94909ceb1ed4bfdb2ada72f93236305e6d6951f upstream. Replace all ret/retq instructions with RET in preparation of making RET a macro. Since AS is case insensitive it's a big no-op without RET defined. find arch/x86/ -name \*.S | while read file do sed -i 's/\<ret[q]*\>/RET/' $file done Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211204134907.905503893@infradead.org [bwh: Backported to 5.10: ran the above command] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-21x86/pat: Fix x86_has_pat_wp()Juergen Gross1-2/+12
commit 230ec83d4299b30c51a1c133b4f2a669972cc08a upstream. x86_has_pat_wp() is using a wrong test, as it relies on the normal PAT configuration used by the kernel. In case the PAT MSR has been setup by another entity (e.g. Xen hypervisor) it might return false even if the PAT configuration is allowing WP mappings. This due to the fact that when running as Xen PV guest the PAT MSR is setup by the hypervisor and cannot be changed by the guest. This results in the WP related entry to be at a different position when running as Xen PV guest compared to the bare metal or fully virtualized case. The correct way to test for WP support is: 1. Get the PTE protection bits needed to select WP mode by reading __cachemode2pte_tbl[_PAGE_CACHE_MODE_WP] (depending on the PAT MSR setting this might return protection bits for a stronger mode, e.g. UC-) 2. Translate those bits back into the real cache mode selected by those PTE bits by reading __pte2cachemode_tbl[__pte2cm_idx(prot)] 3. Test for the cache mode to be _PAGE_CACHE_MODE_WP Fixes: f88a68facd9a ("x86/mm: Extend early_memremap() support with additional attrs") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.14 Link: https://lore.kernel.org/r/20220503132207.17234-1-jgross@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-09x86: Fix return value of __setup handlersRandy Dunlap1-1/+1
[ Upstream commit 12441ccdf5e2f5a01a46e344976cbbd3d46845c9 ] __setup() handlers should return 1 to obsolete_checksetup() in init/main.c to indicate that the boot option has been handled. A return of 0 causes the boot option/value to be listed as an Unknown kernel parameter and added to init's (limited) argument (no '=') or environment (with '=') strings. So return 1 from these x86 __setup handlers. Examples: Unknown kernel command line parameters "apicpmtimer BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable", will be passed to user space. Run /sbin/init as init process with arguments: /sbin/init apicpmtimer with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable Fixes: 2aae950b21e4 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu") Fixes: 77b52b4c5c66 ("x86: add "debugpat" boot option") Fixes: e16fd002afe2 ("x86/cpufeature: Enable RING3MWAIT for Knights Landing") Fixes: b8ce33590687 ("x86_64: convert to clock events") Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Link: https://lore.kernel.org/r/20220314012725.26661-1-rdunlap@infradead.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-03-16x86/boot: Add setup_indirect support in early_memremap_is_setup_data()Ross Philipson1-2/+31
commit 445c1470b6ef96440e7cfc42dfc160f5004fd149 upstream. The x86 boot documentation describes the setup_indirect structures and how they are used. Only one of the two functions in ioremap.c that needed to be modified to be aware of the introduction of setup_indirect functionality was updated. Adds comparable support to the other function where it was missing. Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1645668456-22036-3-git-send-email-ross.philipson@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/boot: Fix memremap of setup_indirect structuresRoss Philipson1-4/+20
commit 7228918b34615ef6317edcd9a058a057bc54aa32 upstream. As documented, the setup_indirect structure is nested inside the setup_data structures in the setup_data list. The code currently accesses the fields inside the setup_indirect structure but only the sizeof(struct setup_data) is being memremapped. No crash occurred but this is just due to how the area is remapped under the covers. Properly memremap both the setup_data and setup_indirect structures in these cases before accessing them. Fixes: b3c72fc9a78e ("x86/boot: Introduce setup_indirect") Signed-off-by: Ross Philipson <ross.philipson@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/1645668456-22036-2-git-send-email-ross.philipson@oracle.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Make the #VC exception stacks part of the default stacks storageBorislav Petkov1-0/+7
commit 541ac97186d9ea88491961a46284de3603c914fd upstream. The size of the exception stacks was increased by the commit in Fixes, resulting in stack sizes greater than a page in size. The #VC exception handling was only mapping the first (bottom) page, resulting in an SEV-ES guest failing to boot. Make the #VC exception stacks part of the default exception stacks storage and allocate them with a CONFIG_AMD_MEM_ENCRYPT=y .config. Map them only when a SEV-ES guest has been detected. Rip out the custom VC stacks mapping and storage code. [ bp: Steal and adapt Tom's commit message. ] Fixes: 7fae4c24a2b8 ("x86: Increase exception stack sizes") Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Tested-by: Brijesh Singh <brijesh.singh@amd.com> Link: https://lkml.kernel.org/r/YVt1IMjIs7pIZTRR@zn.tnic Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sev: Add an x86 version of cc_platform_has()Tom Lendacky1-0/+1
commit aa5a461171f98fde0df78c4f6b5018a1e967cf81 upstream. Introduce an x86 version of the cc_platform_has() function. This will be used to replace vendor specific calls like sme_active(), sev_active(), etc. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210928191009.32551-4-bp@alien8.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/sme: Use #define USE_EARLY_PGTABLE_L5 in mem_encrypt_identity.cTom Lendacky1-0/+9
commit e7d445ab26db833d6640d4c9a08bee176777cc82 upstream. When runtime support for converting between 4-level and 5-level pagetables was added to the kernel, the SME code that built pagetables was updated to use the pagetable functions, e.g. p4d_offset(), etc., in order to simplify the code. However, the use of the pagetable functions in early boot code requires the use of the USE_EARLY_PGTABLE_L5 #define in order to ensure that the proper definition of pgtable_l5_enabled() is used. Without the #define, pgtable_l5_enabled() is #defined as cpu_feature_enabled(X86_FEATURE_LA57). In early boot, the CPU features have not yet been discovered and populated, so pgtable_l5_enabled() will return false even when 5-level paging is enabled. This causes the SME code to always build 4-level pagetables to perform the in-place encryption. If 5-level paging is enabled, switching to the SME pagetables results in a page-fault that kills the boot. Adding the #define results in pgtable_l5_enabled() using the __pgtable_l5_enabled variable set in early boot and the SME code building pagetables for the proper paging level. Fixes: aad983913d77 ("x86/mm/encrypt: Simplify sme_populate_pgd() and sme_populate_pgd_large()") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> # 4.18.x Link: https://lkml.kernel.org/r/2cb8329655f5c753905812d951e212022a480475.1634318656.git.thomas.lendacky@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22x86/mm: Fix kern_addr_valid() to cope with existing but not present entriesMike Rapoport1-3/+3
commit 34b1999da935a33be6239226bfa6cd4f704c5c88 upstream. Jiri Olsa reported a fault when running: # cat /proc/kallsyms | grep ksys_read ffffffff8136d580 T ksys_read # objdump -d --start-address=0xffffffff8136d580 --stop-address=0xffffffff8136d590 /proc/kcore /proc/kcore: file format elf64-x86-64 Segmentation fault general protection fault, probably for non-canonical address 0xf887ffcbff000: 0000 [#1] SMP PTI CPU: 12 PID: 1079 Comm: objdump Not tainted 5.14.0-rc5qemu+ #508 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:kern_addr_valid Call Trace: read_kcore ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? trace_hardirqs_on ? rcu_read_lock_sched_held ? lock_acquire ? lock_acquire ? rcu_read_lock_sched_held ? lock_acquire ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? rcu_read_lock_sched_held ? lock_release ? _raw_spin_unlock ? __handle_mm_fault ? rcu_read_lock_sched_held ? lock_acquire ? rcu_read_lock_sched_held ? lock_release proc_reg_read ? vfs_read vfs_read ksys_read do_syscall_64 entry_SYSCALL_64_after_hwframe The fault happens because kern_addr_valid() dereferences existent but not present PMD in the high kernel mappings. Such PMDs are created when free_kernel_image_pages() frees regions larger than 2Mb. In this case, a part of the freed memory is mapped with PMDs and the set_memory_np_noalias() -> ... -> __change_page_attr() sequence will mark the PMD as not present rather than wipe it completely. Have kern_addr_valid() check whether higher level page table entries are present before trying to dereference them to fix this issue and to avoid similar issues in the future. Stable backporting note: ------------------------ Note that the stable marking is for all active stable branches because there could be cases where pagetable entries exist but are not valid - see 9a14aefc1d28 ("x86: cpa, fix lookup_address"), for example. So make sure to be on the safe side here and use pXY_present() accessors rather than pXY_none() which could #GP when accessing pages in the direct map. Also see: c40a56a7818c ("x86/mm/init: Remove freed kernel image areas from alias mapping") for more info. Reported-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Jiri Olsa <jolsa@redhat.com> Cc: <stable@vger.kernel.org> # 4.4+ Link: https://lkml.kernel.org/r/20210819132717.19358-1-rppt@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-22x86/pat: Pass valid address to sanitize_phys()Jeff Moyer1-1/+6
commit aeef8b5089b76852bd84889f2809e69a7cfb414e upstream. The end address passed to memtype_reserve() is handed directly to sanitize_phys(). However, end is exclusive and sanitize_phys() expects an inclusive address. If end falls at the end of the physical address space, sanitize_phys() will return 0. This can result in drivers failing to load, and the following warning: WARNING: CPU: 26 PID: 749 at arch/x86/mm/pat.c:354 reserve_memtype+0x262/0x450 reserve_memtype failed: [mem 0x3ffffff00000-0xffffffffffffffff], req uncached-minus Call Trace: [<ffffffffa427b1f2>] reserve_memtype+0x262/0x450 [<ffffffffa42764aa>] ioremap_nocache+0x1a/0x20 [<ffffffffc04620a1>] mpt3sas_base_map_resources+0x151/0xa60 [mpt3sas] [<ffffffffc0465555>] mpt3sas_base_attach+0xf5/0xa50 [mpt3sas] ---[ end trace 6d6eea4438db89ef ]--- ioremap reserve_memtype failed -22 mpt3sas_cm0: unable to map adapter memory! or resource not found mpt3sas_cm0: failure at drivers/scsi/mpt3sas/mpt3sas_scsih.c:10597/_scsih_probe()! Fix this by passing the inclusive end address to sanitize_phys(). Fixes: 510ee090abc3 ("x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses") Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/x49o8a3pu5i.fsf@segfault.boston.devel.redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23x86/ioremap: Map EFI-reserved memory as encrypted for SEVTom Lendacky1-1/+3
commit 8d651ee9c71bb12fc0c8eb2786b66cbe5aa3e43b upstream. Some drivers require memory that is marked as EFI boot services data. In order for this memory to not be re-used by the kernel after ExitBootServices(), efi_mem_reserve() is used to preserve it by inserting a new EFI memory descriptor and marking it with the EFI_MEMORY_RUNTIME attribute. Under SEV, memory marked with the EFI_MEMORY_RUNTIME attribute needs to be mapped encrypted by Linux, otherwise the kernel might crash at boot like below: EFI Variables Facility v0.08 2004-May-17 general protection fault, probably for non-canonical address 0x3597688770a868b2: 0000 [#1] SMP NOPTI CPU: 13 PID: 1 Comm: swapper/0 Not tainted 5.12.4-2-default #1 openSUSE Tumbleweed Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:efi_mokvar_entry_next [...] Call Trace: efi_mokvar_sysfs_init ? efi_mokvar_table_init do_one_initcall ? __kmalloc kernel_init_freeable ? rest_init kernel_init ret_from_fork Expand the __ioremap_check_other() function to additionally check for this other type of boot data reserved at runtime and indicate that it should be mapped encrypted for an SEV guest. [ bp: Massage commit message. ] Fixes: 58c909022a5a ("efi: Support for MOK variable config table") Reported-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Joerg Roedel <jroedel@suse.de> Cc: <stable@vger.kernel.org> # 5.10+ Link: https://lkml.kernel.org/r/20210608095439.12668-2-joro@8bytes.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-23x86/mm: Avoid truncating memblocks for SGX memoryFan Du1-1/+7
commit 28e5e44aa3f4e0e0370864ed008fb5e2d85f4dc8 upstream. tl;dr: Several SGX users reported seeing the following message on NUMA systems: sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0. This turned out to be the memblock code mistakenly throwing away SGX memory. === Full Changelog === The 'max_pfn' variable represents the highest known RAM address. It can be used, for instance, to quickly determine for which physical addresses there is mem_map[] space allocated. The numa_meminfo code makes an effort to throw out ("trim") all memory blocks which are above 'max_pfn'. SGX memory is not considered RAM (it is marked as "Reserved" in the e820) and is not taken into account by max_pfn. Despite this, SGX memory areas have NUMA affinity and are enumerated in the ACPI SRAT table. The existing SGX code uses the numa_meminfo mechanism to look up the NUMA affinity for its memory areas. In cases where SGX memory was above max_pfn (usually just the one EPC section in the last highest NUMA node), the numa_memblock is truncated at 'max_pfn', which is below the SGX memory. When the SGX code tries to look up the affinity of this memory, it fails and produces an error message: sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0. and assigns the memory to NUMA node 0. Instead of silently truncating the memory block at 'max_pfn' and dropping the SGX memory, add the truncated portion to 'numa_reserved_meminfo'. This allows the SGX code to later determine the NUMA affinity of its 'Reserved' area. Before, numa_meminfo looked like this (from 'crash'): blk = { start = 0x0, end = 0x2080000000, nid = 0x0 } { start = 0x2080000000, end = 0x4000000000, nid = 0x1 } numa_reserved_meminfo is empty. With this, numa_meminfo looks like this: blk = { start = 0x0, end = 0x2080000000, nid = 0x0 } { start = 0x2080000000, end = 0x4000000000, nid = 0x1 } and numa_reserved_meminfo has an entry for node 1's SGX memory: blk = { start = 0x4000000000, end = 0x4080000000, nid = 0x1 } [ daveh: completely rewrote/reworked changelog ] Fixes: 5d30f92e7631 ("x86/NUMA: Provide a range-to-target_node lookup facility") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-06-10x86/sev: Check SME/SEV support in CPUID firstPu Wen1-5/+6
commit 009767dbf42ac0dbe3cf48c1ee224f6b778aa85a upstream. The first two bits of the CPUID leaf 0x8000001F EAX indicate whether SEV or SME is supported, respectively. It's better to check whether SEV or SME is actually supported before accessing the MSR_AMD64_SEV to check whether SEV or SME is enabled. This is both a bare-metal issue and a guest/VM issue. Since the first generation Hygon Dhyana CPU doesn't support the MSR_AMD64_SEV, reading that MSR results in a #GP - either directly from hardware in the bare-metal case or via the hypervisor (because the RDMSR is actually intercepted) in the guest/VM case, resulting in a failed boot. And since this is very early in the boot phase, rdmsrl_safe()/native_read_msr_safe() can't be used. So check the CPUID bits first, before accessing the MSR. [ tlendacky: Expand and improve commit message. ] [ bp: Massage commit message. ] Fixes: eab696d8e8b9 ("x86/sev: Do not require Hypervisor CPUID bit for SEV guests") Signed-off-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> # v5.10+ Link: https://lkml.kernel.org/r/20210602070207.2480-1-puwen@hygon.cn Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-05-11x86/sev: Do not require Hypervisor CPUID bit for SEV guestsJoerg Roedel1-16/+19
[ Upstream commit eab696d8e8b9c9d600be6fad8dd8dfdfaca6ca7c ] A malicious hypervisor could disable the CPUID intercept for an SEV or SEV-ES guest and trick it into the no-SEV boot path, where it could potentially reveal secrets. This is not an issue for SEV-SNP guests, as the CPUID intercept can't be disabled for those. Remove the Hypervisor CPUID bit check from the SEV detection code to protect against this kind of attack and add a Hypervisor bit equals zero check to the SME detection path to prevent non-encrypted guests from trying to enable SME. This handles the following cases: 1) SEV(-ES) guest where CPUID intercept is disabled. The guest will still see leaf 0x8000001f and the SEV bit. It can retrieve the C-bit and boot normally. 2) Non-encrypted guests with intercepted CPUID will check the SEV_STATUS MSR and find it 0 and will try to enable SME. This will fail when the guest finds MSR_K8_SYSCFG to be zero, as it is emulated by KVM. But we can't rely on that, as there might be other hypervisors which return this MSR with bit 23 set. The Hypervisor bit check will prevent that the guest tries to enable SME in this case. 3) Non-encrypted guests on SEV capable hosts with CPUID intercept disabled (by a malicious hypervisor) will try to boot into the SME path. This will fail, but it is also not considered a problem because non-encrypted guests have no protection against the hypervisor anyway. [ bp: s/non-SEV/non-encrypted/g ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20210312123824.306-3-joro@8bytes.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-30x86/mem_encrypt: Correct physical address calculation in __set_clr_pte_enc()Isaku Yamahata1-1/+1
commit 8249d17d3194eac064a8ca5bc5ca0abc86feecde upstream. The pfn variable contains the page frame number as returned by the pXX_pfn() functions, shifted to the right by PAGE_SHIFT to remove the page bits. After page protection computations are done to it, it gets shifted back to the physical address using page_level_shift(). That is wrong, of course, because that function determines the shift length based on the level of the page in the page table but in all the cases, it was shifted by PAGE_SHIFT before. Therefore, shift it back using PAGE_SHIFT to get the correct physical address. [ bp: Rewrite commit message. ] Fixes: dfaaec9033b8 ("x86: Add support for changing memory encryption attribute in early boot") Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/81abbae1657053eccc535c16151f63cd049dcb97.1616098294.git.isaku.yamahata@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04x86: fix seq_file iteration for pat/memtype.cNeilBrown1-2/+2
commit 3d2fc4c082448e9c05792f9b2a11c1d5db408b85 upstream. The memtype seq_file iterator allocates a buffer in the ->start and ->next functions and frees it in the ->show function. The preferred handling for such resources is to free them in the subsequent ->next or ->stop function call. Since Commit 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") there is no guarantee that ->show will be called after ->next, so this function can now leak memory. So move the freeing of the buffer to ->next and ->stop. Link: https://lkml.kernel.org/r/161248539022.21478.13874455485854739066.stgit@noble1 Fixes: 1f4aace60b0e ("fs/seq_file.c: simplify seq_file iteration code and interface") Signed-off-by: NeilBrown <neilb@suse.de> Cc: Xin Long <lucien.xin@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04x86/fault: Fix AMD erratum #91 errata fixup for user codeAndy Lutomirski1-10/+17
commit 35f1c89b0cce247bf0213df243ed902989b1dcda upstream. The recent rework of probe_kernel_address() and its conversion to get_kernel_nofault() inadvertently broke is_prefetch(). Before this change, probe_kernel_address() was used as a sloppy "read user or kernel memory" helper, but it doesn't do that any more. The new get_kernel_nofault() reads *kernel* memory only, which completely broke is_prefetch() for user access. Adjust the code to the correct accessor based on access mode. The manual address bounds check is no longer necessary, since the accessor helpers (get_user() / get_kernel_nofault()) do the right thing all by themselves. As a bonus, by using the correct accessor, the open-coded address bounds check is not needed anymore. [ bp: Massage commit message. ] Fixes: eab0c6089b68 ("maccess: unify the probe kernel arch hooks") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/b91f7f92f3367d2d3a88eec3b09c6aab1b2dc8ef.1612924255.git.luto@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-02-10KVM: SVM: Treat SVM as unsupported when running as an SEV guestSean Christopherson1-0/+1
commit ccd85d90ce092bdb047a7f6580f3955393833b22 upstream. Don't let KVM load when running as an SEV guest, regardless of what CPUID says. Memory is encrypted with a key that is not accessible to the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll see garbage when reading the VMCB. Technically, KVM could decrypt all memory that needs to be accessible to the L0 and use shadow paging so that L0 does not need to shadow NPT, but exposing such information to L0 largely defeats the purpose of running as an SEV guest. This can always be revisited if someone comes up with a use case for running VMs inside SEV guests. Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM is doomed even if the SEV guest is debuggable and the hypervisor is willing to decrypt the VMCB. This may or may not be fixed on CPUs that have the SVME_ADDR_CHK fix. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210202212017.2486595-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-01-12x86/mm: Fix leak of pmd ptlockDan Williams1-0/+2
commit d1c5246e08eb64991001d97a3bd119c93edbc79a upstream. Commit 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") introduced a new location where a pmd was released, but neglected to run the pmd page destructor. In fact, this happened previously for a different pmd release path and was fixed by commit: c283610e44ec ("x86, mm: do not leak page->ptl for pmd page tables"). This issue was hidden until recently because the failure mode is silent, but commit: b2b29d6d0119 ("mm: account PMD tables like PTE tables") turns the failure mode into this signature: BUG: Bad page state in process lt-pmem-ns pfn:15943d page:000000007262ed7b refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x15943d flags: 0xaffff800000000() raw: 00affff800000000 dead000000000100 0000000000000000 0000000000000000 raw: 0000000000000000 ffff913a029bcc08 00000000fffffbff 0000000000000000 page dumped because: nonzero mapcount [..] dump_stack+0x8b/0xb0 bad_page.cold+0x63/0x94 free_pcp_prepare+0x224/0x270 free_unref_page+0x18/0xd0 pud_free_pmd_page+0x146/0x160 ioremap_pud_range+0xe3/0x350 ioremap_page_range+0x108/0x160 __ioremap_caller.constprop.0+0x174/0x2b0 ? memremap+0x7a/0x110 memremap+0x7a/0x110 devm_memremap+0x53/0xa0 pmem_attach_disk+0x4ed/0x530 [nd_pmem] ? __devm_release_region+0x52/0x80 nvdimm_bus_probe+0x85/0x210 [libnvdimm] Given this is a repeat occurrence it seemed prudent to look for other places where this destructor might be missing and whether a better helper is needed. try_to_free_pmd_page() looks like a candidate, but testing with setting up and tearing down pmd mappings via the dax unit tests is thus far not triggering the failure. As for a better helper pmd_free() is close, but it is a messy fit due to requiring an @mm arg. Also, ___pmd_free_tlb() wants to call paravirt_tlb_remove_table() instead of free_page(), so open-coded pgtable_pmd_page_dtor() seems the best way forward for now. Debugged together with Matthew Wilcox <willy@infradead.org>. Fixes: 28ee90fe6048 ("x86/mm: implement free pmd/pte page interfaces") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Yi Zhang <yi.zhang@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/160697689204.605323.17629854984697045602.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-30x86/mm/ident_map: Check for errors from ident_pud_init()Arvind Sankar1-2/+10
[ Upstream commit 1fcd009102ee02e217f2e7635ab65517d785da8e ] Commit ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support") added ident_p4d_init() to support 5-level paging, but this function doesn't check and return errors from ident_pud_init(). For example, the decompressor stub uses this code to create an identity mapping. If it runs out of pages while trying to allocate a PMD pagetable, the error will be currently ignored. Fix this to propagate errors. [ bp: Space out statements for better readability. ] Fixes: ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Joerg Roedel <jroedel@suse.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/20201027230648.1885111-1-nivedita@alum.mit.edu Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-12-10x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WPArvind Sankar1-2/+2
The PAT bit is in different locations for 4k and 2M/1G page table entries. Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages, and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT index for write-protected pages. Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
2020-12-09x86/membarrier: Get rid of a dubious optimizationAndy Lutomirski1-2/+8
sync_core_before_usermode() had an incorrect optimization. If the kernel returns from an interrupt, it can get to usermode without IRET. It just has to schedule to a different task in the same mm and do SYSRET. Fortunately, there were no callers of sync_core_before_usermode() that could have had in_irq() or in_nmi() equal to true, because it's only ever called from the scheduler. While at it, clarify a related comment. Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org
2020-11-22mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exportsDan Williams1-0/+2
The core-mm has a default __weak implementation of phys_to_target_node() to mirror the weak definition of memory_add_physaddr_to_nid(). That symbol is exported for modules. However, while the export in mm/memory_hotplug.c exported the symbol in the configuration cases of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=y ...and: CONFIG_NUMA_KEEP_MEMINFO=n CONFIG_MEMORY_HOTPLUG=y ...it failed to export the symbol in the case of: CONFIG_NUMA_KEEP_MEMINFO=y CONFIG_MEMORY_HOTPLUG=n Not only is that broken, but Christoph points out that the kernel should not be exporting any __weak symbol, which means that memory_add_physaddr_to_nid() example that phys_to_target_node() copied is broken too. Rework the definition of phys_to_target_node() and memory_add_physaddr_to_nid() to not require weak symbols. Move to the common arch override design-pattern of an asm header defining a symbol to replace the default implementation. The only common header that all memory_add_physaddr_to_nid() producing architectures implement is asm/sparsemem.h. In fact, powerpc already defines its memory_add_physaddr_to_nid() helper in sparsemem.h. Double-down on that observation and define phys_to_target_node() where necessary in asm/sparsemem.h. An alternate consideration that was discarded was to put this override in asm/numa.h, but that entangles with the definition of MAX_NUMNODES relative to the inclusion of linux/nodemask.h, and requires powerpc to grow a new header. The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid now that the symbol is properly exported / stubbed in all combinations of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG. [dan.j.williams@intel.com: v4] Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com [dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning] Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: a035b6bf863e ("mm/memory_hotplug: introduce default phys_to_target_node() implementation") Reported-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Randy Dunlap <rdunlap@infradead.org> Tested-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Link: https://lkml.kernel.org/r/160447639846.1133764.7044090803980177548.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-29x86/head/64: Check SEV encryption before switching to kernel page-tableJoerg Roedel1-0/+1
When SEV is enabled, the kernel requests the C-bit position again from the hypervisor to build its own page-table. Since the hypervisor is an untrusted source, the C-bit position needs to be verified before the kernel page-table is used. Call sev_verify_cbit() before writing the CR3. [ bp: Massage. ] Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lkml.kernel.org/r/20201028164659.27002-5-joro@8bytes.org
2020-10-26treewide: Convert macro and uses of __section(foo) to __section("foo")Joe Perches2-4/+4
Use a more generic form for __section that requires quotes to avoid complications with clang and gcc differences. Remove the quote operator # from compiler_attributes.h __section macro. Convert all unquoted __section(foo) uses to quoted __section("foo"). Also convert __attribute__((section("foo"))) uses to __section("foo") even if the __attribute__ has multiple list entry forms. Conversion done using the script at: https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com> Reviewed-by: Miguel Ojeda <ojeda@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-23Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-5/+8
Pull KVM updates from Paolo Bonzini: "For x86, there is a new alternative and (in the future) more scalable implementation of extended page tables that does not need a reverse map from guest physical addresses to host physical addresses. For now it is disabled by default because it is still lacking a few of the existing MMU's bells and whistles. However it is a very solid piece of work and it is already available for people to hammer on it. Other updates: ARM: - New page table code for both hypervisor and guest stage-2 - Introduction of a new EL2-private host context - Allow EL2 to have its own private per-CPU variables - Support of PMU event filtering - Complete rework of the Spectre mitigation PPC: - Fix for running nested guests with in-kernel IRQ chip - Fix race condition causing occasional host hard lockup - Minor cleanups and bugfixes x86: - allow trapping unknown MSRs to userspace - allow userspace to force #GP on specific MSRs - INVPCID support on AMD - nested AMD cleanup, on demand allocation of nested SVM state - hide PV MSRs and hypercalls for features not enabled in CPUID - new test for MSR_IA32_TSC writes from host and guest - cleanups: MMU, CPUID, shared MSRs - LAPIC latency optimizations ad bugfixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits) kvm: x86/mmu: NX largepage recovery for TDP MMU kvm: x86/mmu: Don't clear write flooding count for direct roots kvm: x86/mmu: Support MMIO in the TDP MMU kvm: x86/mmu: Support write protection for nesting in tdp MMU kvm: x86/mmu: Support disabling dirty logging for the tdp MMU kvm: x86/mmu: Support dirty logging for the TDP MMU kvm: x86/mmu: Support changed pte notifier in tdp MMU kvm: x86/mmu: Add access tracking for tdp_mmu kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU kvm: x86/mmu: Add TDP MMU PF handler kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg kvm: x86/mmu: Support zapping SPTEs in the TDP MMU KVM: Cache as_id in kvm_memory_slot kvm: x86/mmu: Add functions to handle changed TDP SPTEs kvm: x86/mmu: Allocate and free TDP MMU roots kvm: x86/mmu: Init / Uninit the TDP MMU kvm: x86/mmu: Introduce tdp_iter KVM: mmu: extract spte.h and spte.c KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp ...
2020-10-22x86/kvm: Update the comment about asynchronous page fault in exc_page_fault()Vitaly Kuznetsov1-5/+8
KVM was switched to interrupt-based mechanism for 'page ready' event delivery in Linux-5.8 (see commit 2635b5c4a0e4 ("KVM: x86: interrupt based APF 'page ready' event delivery")) and #PF (ab)use for 'page ready' event delivery was removed. Linux guest switched to this new mechanism exclusively in 5.9 (see commit b1d405751cd5 ("KVM: x86: Switch KVM guest to using interrupts for page ready APF delivery")) so it is not possible to get #PF for a 'page ready' event even when the guest is running on top of an older KVM (APF mechanism won't be enabled). Update the comment in exc_page_fault() to reflect the new reality. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20201002154313.1505327-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-14Merge tag 'acpi-5.10-rc1' of ↵Linus Torvalds1-0/+21
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI updates from Rafael Wysocki: "These add support for generic initiator-only proximity domains to the ACPI NUMA code and the architectures using it, clean up some non-ACPICA code referring to debug facilities from ACPICA, reduce the overhead related to accessing GPE registers, add a new DPTF (Dynamic Power and Thermal Framework) participant driver, update the ACPICA code in the kernel to upstream revision 20200925, add a new ACPI backlight whitelist entry, fix a few assorted issues and clean up some code. Specifics: - Add support for generic initiator-only proximity domains to the ACPI NUMA code and the architectures using it (Jonathan Cameron) - Clean up some non-ACPICA code referring to debug facilities from ACPICA that are not actually used in there (Hanjun Guo) - Add new DPTF driver for the PCH FIVR participant (Srinivas Pandruvada) - Reduce overhead related to accessing GPE registers in ACPICA and the OS interface layer and make it possible to access GPE registers using logical addresses if they are memory-mapped (Rafael Wysocki) - Update the ACPICA code in the kernel to upstream revision 20200925 including changes as follows: + Add predefined names from the SMBus sepcification (Bob Moore) + Update acpi_help UUID list (Bob Moore) + Return exceptions for string-to-integer conversions in iASL (Bob Moore) + Add a new "ALL <NameSeg>" debugger command (Bob Moore) + Add support for 64 bit risc-v compilation (Colin Ian King) + Do assorted cleanups (Bob Moore, Colin Ian King, Randy Dunlap) - Add new ACPI backlight whitelist entry for HP 635 Notebook (Alex Hung) - Move TPS68470 OpRegion driver to drivers/acpi/pmic/ and split out Kconfig and Makefile specific for ACPI PMIC (Andy Shevchenko) - Clean up the ACPI SoC driver for AMD SoCs (Hanjun Guo) - Add missing config_item_put() to fix refcount leak (Hanjun Guo) - Drop lefrover field from struct acpi_memory_device (Hanjun Guo) - Make the ACPI extlog driver check for RDMSR failures (Ben Hutchings) - Fix handling of lid state changes in the ACPI button driver when input device is closed (Dmitry Torokhov) - Fix several assorted build issues (Barnabás Pőcze, John Garry, Nathan Chancellor, Tian Tao) - Drop unused inline functions and reduce code duplication by using kobj_to_dev() in the NFIT parsing code (YueHaibing, Wang Qing) - Serialize tools/power/acpi Makefile (Thomas Renninger)" * tag 'acpi-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (64 commits) ACPICA: Update version to 20200925 Version 20200925 ACPICA: Remove unnecessary semicolon ACPICA: Debugger: Add a new command: "ALL <NameSeg>" ACPICA: iASL: Return exceptions for string-to-integer conversions ACPICA: acpi_help: Update UUID list ACPICA: Add predefined names found in the SMBus sepcification ACPICA: Tree-wide: fix various typos and spelling mistakes ACPICA: Drop the repeated word "an" in a comment ACPICA: Add support for 64 bit risc-v compilation ACPI: button: fix handling lid state changes when input device closed tools/power/acpi: Serialize Makefile ACPI: scan: Replace ACPI_DEBUG_PRINT() with pr_debug() ACPI: memhotplug: Remove 'state' from struct acpi_memory_device ACPI / extlog: Check for RDMSR failure ACPI: Make acpi_evaluate_dsm() prototype consistent docs: mm: numaperf.rst Add brief description for access class 1. node: Add access1 class to represent CPU to memory characteristics ACPI: HMAT: Fix handling of changes from ACPI 6.2 to ACPI 6.3 ACPI: Let ACPI know we support Generic Initiator Affinity Structures x86: Support Generic Initiator only proximity domains ...
2020-10-14Merge tag 'x86_seves_for_v5.10' of ↵Linus Torvalds4-5/+40
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SEV-ES support from Borislav Petkov: "SEV-ES enhances the current guest memory encryption support called SEV by also encrypting the guest register state, making the registers inaccessible to the hypervisor by en-/decrypting them on world switches. Thus, it adds additional protection to Linux guests against exfiltration, control flow and rollback attacks. With SEV-ES, the guest is in full control of what registers the hypervisor can access. This is provided by a guest-host exchange mechanism based on a new exception vector called VMM Communication Exception (#VC), a new instruction called VMGEXIT and a shared Guest-Host Communication Block which is a decrypted page shared between the guest and the hypervisor. Intercepts to the hypervisor become #VC exceptions in an SEV-ES guest so in order for that exception mechanism to work, the early x86 init code needed to be made able to handle exceptions, which, in itself, brings a bunch of very nice cleanups and improvements to the early boot code like an early page fault handler, allowing for on-demand building of the identity mapping. With that, !KASLR configurations do not use the EFI page table anymore but switch to a kernel-controlled one. The main part of this series adds the support for that new exchange mechanism. The goal has been to keep this as much as possibly separate from the core x86 code by concentrating the machinery in two SEV-ES-specific files: arch/x86/kernel/sev-es-shared.c arch/x86/kernel/sev-es.c Other interaction with core x86 code has been kept at minimum and behind static keys to minimize the performance impact on !SEV-ES setups. Work by Joerg Roedel and Thomas Lendacky and others" * tag 'x86_seves_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (73 commits) x86/sev-es: Use GHCB accessor for setting the MMIO scratch buffer x86/sev-es: Check required CPU features for SEV-ES x86/efi: Add GHCB mappings when SEV-ES is active x86/sev-es: Handle NMI State x86/sev-es: Support CPU offline/online x86/head/64: Don't call verify_cpu() on starting APs x86/smpboot: Load TSS and getcpu GDT entry before loading IDT x86/realmode: Setup AP jump table x86/realmode: Add SEV-ES specific trampoline entry point x86/vmware: Add VMware-specific handling for VMMCALL under SEV-ES x86/kvm: Add KVM-specific VMMCALL handling under SEV-ES x86/paravirt: Allow hypervisor-specific VMMCALL handling under SEV-ES x86/sev-es: Handle #DB Events x86/sev-es: Handle #AC Events x86/sev-es: Handle VMMCALL Events x86/sev-es: Handle MWAIT/MWAITX Events x86/sev-es: Handle MONITOR/MONITORX Events x86/sev-es: Handle INVD Events x86/sev-es: Handle RDPMC Events x86/sev-es: Handle RDTSC(P) Events ...
2020-10-14memblock: use separate iterators for memory and reserved regionsMike Rapoport1-1/+1
for_each_memblock() is used to iterate over memblock.memory in a few places that use data from memblock_region rather than the memory ranges. Introduce separate for_each_mem_region() and for_each_reserved_mem_region() to improve encapsulation of memblock internals from its users. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> [x86] Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> [MIPS] Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> [.clang-format] Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Axtens <dja@axtens.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Emil Renner Berthing <kernel@esmil.dk> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14mm/memory_hotplug: introduce default phys_to_target_node() implementationDan Williams1-1/+0
In preparation to set a fallback value for dev_dax->target_node, introduce generic fallback helpers for phys_to_target_node() A generic implementation based on node-data or memblock was proposed, but as noted by Mike: "Here again, I would prefer to add a weak default for phys_to_target_node() because the "generic" implementation is not really generic. The fallback to reserved ranges is x86 specfic because on x86 most of the reserved areas is not in memblock.memory. AFAIK, no other architecture does this." The info message in the generic memory_add_physaddr_to_nid() implementation is fixed up to properly reflect that memory_add_physaddr_to_nid() communicates "online" node info and phys_to_target_node() indicates "target / to-be-onlined" node info. [akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build] Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jia He <justin.he@arm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14x86/numa: add 'nohmat' optionDan Williams1-0/+2
Disable parsing of the HMAT for debug, to workaround broken platform instances, or cases where it is otherwise not wanted. [rdunlap@infradead.org: fix build when CONFIG_ACPI is not set] Link: https://lkml.kernel.org/r/70e5ee34-9809-a997-7b49-499e4be61307@infradead.org Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Link: https://lkml.kernel.org/r/159643095540.4062302.732962081968036212.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-14x86/numa: cleanup configuration dependent command-line optionsDan Williams2-7/+4
Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5. The device-dax facility allows an address range to be directly mapped through a chardev, or optionally hotplugged to the core kernel page allocator as System-RAM. It is the mechanism for converting persistent memory (pmem) to be used as another volatile memory pool i.e. the current Memory Tiering hot topic on linux-mm. In the case of pmem the nvdimm-namespace-label mechanism can sub-divide it, but that labeling mechanism is not available / applicable to soft-reserved ("EFI specific purpose") memory [3]. This series provides a sysfs-mechanism for the daxctl utility to enable provisioning of volatile-soft-reserved memory ranges. The motivations for this facility are: 1/ Allow performance differentiated memory ranges to be split between kernel-managed and directly-accessed use cases. 2/ Allow physical memory to be provisioned along performance relevant address boundaries. For example, divide a memory-side cache [4] along cache-color boundaries. 3/ Parcel out soft-reserved memory to VMs using device-dax as a security / permissions boundary [5]. Specifically I have seen people (ab)using memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the device-dax interface on custom address ranges. A follow-on for the VM use case is to teach device-dax to dynamically allocate 'struct page' at runtime to reduce the duplication of 'struct page' space in both the guest and the host kernel for the same physical pages. [2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com [5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com This patch (of 23): In preparation for adding a new numa= option clean up the existing ones to avoid ifdefs in numa_setup(), and provide feedback when the option is numa=fake= option is invalid due to kernel config. The same does not need to be done for numa=noacpi, since the capability is already hard disabled at compile-time. Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brice Goglin <Brice.Goglin@inria.fr> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: David Airlie <airlied@linux.ie> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Will Deacon <will@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Hulk Robot <hulkci@huawei.com> Cc: Jason Yan <yanaijie@huawei.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: kernel test robot <lkp@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-13Merge tag 'x86-mm-2020-10-12' of ↵Linus Torvalds1-10/+10
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 mm updates from Ingo Molnar: "Do not sync vmalloc/ioremap mappings on x86-64 kernels. Hopefully now without the bugs!" * tag 'x86-mm-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm/64: Update comment in preallocate_vmalloc_pages() x86/mm/64: Do not sync vmalloc/ioremap mappings
2020-10-12Merge tag 'x86_platform_for_v5.10' of ↵Linus Torvalds1-24/+0
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 platform updates from Borislav Petkov: - Cleanup different aspects of the UV code and start adding support for the new UV5 class of systems (Mike Travis) - Use a flexible array for a dynamically sized struct uv_rtc_timer_head (Gustavo A. R. Silva) * tag 'x86_platform_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform/uv: Update Copyrights to conform to HPE standards x86/platform/uv: Update for UV5 NMI MMR changes x86/platform/uv: Update UV5 TSC checking x86/platform/uv: Update node present counting x86/platform/uv: Update UV5 MMR references in UV GRU x86/platform/uv: Adjust GAM MMR references affected by UV5 updates x86/platform/uv: Update MMIOH references based on new UV5 MMRs x86/platform/uv: Add and decode Arch Type in UVsystab x86/platform/uv: Add UV5 direct references x86/platform/uv: Update UV MMRs for UV5 drivers/misc/sgi-xp: Adjust references in UV kernel modules x86/platform/uv: Remove SCIR MMR references for UV systems x86/platform/uv: Remove UV BAU TLB Shootdown Handler x86/uv/time: Use a flexible array in struct uv_rtc_timer_head
2020-10-12Merge tag 'x86_cpu_for_v5.10' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Add support for hardware-enforced cache coherency on AMD which obviates the need to flush cachelines before changing the PTE encryption bit (Krish Sadhukhan) - Add Centaur initialization support for families >= 7 (Tony W Wang-oc) - Add a feature flag for, and expose TSX suspend load tracking feature to KVM (Cathy Zhang) - Emulate SLDT and STR so that windows programs don't crash on UMIP machines (Brendan Shanks and Ricardo Neri) - Use the new SERIALIZE insn on Intel hardware which supports it (Ricardo Neri) - Misc cleanups and fixes * tag 'x86_cpu_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: KVM: SVM: Don't flush cache if hardware enforces cache coherency across encryption domains x86/mm/pat: Don't flush cache if hardware enforces cache coherency across encryption domnains x86/cpu: Add hardware-enforced cache coherency as a CPUID feature x86/cpu/centaur: Add Centaur family >=7 CPUs initialization support x86/cpu/centaur: Replace two-condition switch-case with an if statement x86/kvm: Expose TSX Suspend Load Tracking feature x86/cpufeatures: Enumerate TSX suspend load address tracking instructions x86/umip: Add emulation/spoofing for SLDT and STR instructions x86/cpu: Fix typos and improve the comments in sync_core() x86/cpu: Use XGETBV and XSETBV mnemonics in fpu/internal.h x86/cpu: Use SERIALIZE in sync_core() when available
2020-10-12Merge tag 'ras_updates_for_v5.10' of ↵Linus Torvalds2-5/+21
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Extend the recovery from MCE in kernel space also to processes which encounter an MCE in kernel space but while copying from user memory by sending them a SIGBUS on return to user space and umapping the faulty memory, by Tony Luck and Youquan Song. - memcpy_mcsafe() rework by splitting the functionality into copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables support for new hardware which can recover from a machine check encountered during a fast string copy and makes that the default and lets the older hardware which does not support that advance recovery, opt in to use the old, fragile, slow variant, by Dan Williams. - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta. - Do not use MSR-tracing accessors in #MC context and flag any fault while accessing MCA architectural MSRs as an architectural violation with the hope that such hw/fw misdesigns are caught early during the hw eval phase and they don't make it into production. - Misc fixes, improvements and cleanups, as always. * tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Allow for copy_mc_fragile symbol checksum to be generated x86/mce: Decode a kernel instruction to determine if it is copying from user x86/mce: Recover from poison found while copying from user space x86/mce: Avoid tail copy when machine check terminated a copy from user x86/mce: Add _ASM_EXTABLE_CPY for copy user access x86/mce: Provide method to find out the type of an exception handler x86/mce: Pass pointer to saved pt_regs to severity calculation routines x86/copy_mc: Introduce copy_mc_enhanced_fast_string() x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}() x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list x86/mce: Add Skylake quirk for patrol scrub reported errors RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE() x86/mce: Annotate mce_rd/wrmsrl() with noinstr x86/mce/dev-mcelog: Do not update kflags on AMD systems x86/mce: Stop mce_reign() from re-computing severity for every CPU x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR x86/mce: Increase maximum number of banks to 64 x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check() x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap RAS/CEC: Fix cec_init() prototype