summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)AuthorFilesLines
2022-04-20x86/tsx: Disable TSX development mode at bootPawan Gupta4-14/+51
commit 400331f8ffa3bec5c561417e5eec6848464e9160 upstream. A microcode update on some Intel processors causes all TSX transactions to always abort by default[*]. Microcode also added functionality to re-enable TSX for development purposes. With this microcode loaded, if tsx=on was passed on the cmdline, and TSX development mode was already enabled before the kernel boot, it may make the system vulnerable to TSX Asynchronous Abort (TAA). To be on safer side, unconditionally disable TSX development mode during boot. If a viable use case appears, this can be revisited later. [*]: Intel TSX Disable Update for Selected Processors, doc ID: 643557 [ bp: Drop unstable web link, massage heavily. ] Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/347bd844da3a333a9793c6687d4e4eb3b2419a3e.1646943780.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-20x86/tsx: Use MSR_TSX_CTRL to clear CPUID bitsPawan Gupta2-7/+48
commit 258f3b8c3210b03386e4ad92b4bd8652b5c1beb3 upstream. tsx_clear_cpuid() uses MSR_TSX_FORCE_ABORT to clear CPUID.RTM and CPUID.HLE. Not all CPUs support MSR_TSX_FORCE_ABORT, alternatively use MSR_IA32_TSX_CTRL when supported. [ bp: Document how and why TSX gets disabled. ] Fixes: 293649307ef9 ("x86/tsx: Clear CPUID bits when TSX always force aborts") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Neelima Krishnan <neelima.krishnan@intel.com> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/5b323e77e251a9c8bcdda498c5cc0095be1e1d3c.1646943780.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-16x86/sgx: Free backing memory after faulting the enclave pageJarkko Sakkinen1-9/+48
commit 08999b2489b4c9b939d7483dbd03702ee4576d96 upstream. There is a limited amount of SGX memory (EPC) on each system. When that memory is used up, SGX has its own swapping mechanism which is similar in concept but totally separate from the core mm/* code. Instead of swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM comes from a shared memory pseudo-file and can itself be swapped by the core mm code. There is a hierarchy like this: EPC <-> shmem <-> disk After data is swapped back in from shmem to EPC, the shmem backing storage needs to be freed. Currently, the backing shmem is not freed. This effectively wastes the shmem while the enclave is running. The memory is recovered when the enclave is destroyed and the backing storage freed. Sort this out by freeing memory with shmem_truncate_range(), as soon as a page is faulted back to the EPC. In addition, free the memory for PCMD pages as soon as all PCMD's in a page have been marked as unused by zeroing its contents. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Warn about eIBRS + LFENCE + Unprivileged eBPF + SMTJosh Poimboeuf1-2/+25
commit 0de05d056afdb00eca8c7bbb0c79a3438daf700c upstream. The commit 44a3918c8245 ("x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation reporting") added a warning for the "eIBRS + unprivileged eBPF" combination, which has been shown to be vulnerable against Spectre v2 BHB-based attacks. However, there's no warning about the "eIBRS + LFENCE retpoline + unprivileged eBPF" combo. The LFENCE adds more protection by shortening the speculation window after a mispredicted branch. That makes an attack significantly more difficult, even with unprivileged eBPF. So at least for now the logic doesn't warn about that combination. But if you then add SMT into the mix, the SMT attack angle weakens the effectiveness of the LFENCE considerably. So extend the "eIBRS + unprivileged eBPF" warning to also include the "eIBRS + LFENCE + unprivileged eBPF + SMT" case. [ bp: Massage commit message. ] Suggested-by: Alyssa Milburn <alyssa.milburn@linux.intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Warn about Spectre v2 LFENCE mitigationJosh Poimboeuf1-0/+5
commit eafd987d4a82c7bb5aa12f0e3b4f8f3dea93e678 upstream. With: f8a66d608a3e ("x86,bugs: Unconditionally allow spectre_v2=retpoline,amd") it became possible to enable the LFENCE "retpoline" on Intel. However, Intel doesn't recommend it, as it has some weaknesses compared to retpoline. Now AMD doesn't recommend it either. It can still be left available as a cmdline option. It's faster than retpoline but is weaker in certain scenarios -- particularly SMT, but even non-SMT may be vulnerable in some cases. So just unconditionally warn if the user requests it on the cmdline. [ bp: Massage commit message. ] Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Use generic retpoline by default on AMDKim Phillips1-9/+0
commit 244d00b5dd4755f8df892c86cab35fb2cfd4f14b upstream. AMD retpoline may be susceptible to speculation. The speculation execution window for an incorrect indirect branch prediction using LFENCE/JMP sequence may potentially be large enough to allow exploitation using Spectre V2. By default, don't use retpoline,lfence on AMD. Instead, use the generic retpoline. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Include unprivileged eBPF status in Spectre v2 mitigation ↵Josh Poimboeuf1-6/+29
reporting commit 44a3918c8245ab10c6c9719dd12e7a8d291980d8 upstream. With unprivileged eBPF enabled, eIBRS (without retpoline) is vulnerable to Spectre v2 BHB-based attacks. When both are enabled, print a warning message and report it in the 'spectre_v2' sysfs vulnerabilities file. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.15] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Add eIBRS + Retpoline optionsPeter Zijlstra1-37/+96
commit 1e19da8522c81bf46b335f84137165741e0d82b7 upstream. Thanks to the chaps at VUsec it is now clear that eIBRS is not sufficient, therefore allow enabling of retpolines along with eIBRS. Add spectre_v2=eibrs, spectre_v2=eibrs,lfence and spectre_v2=eibrs,retpoline options to explicitly pick your preferred means of mitigation. Since there's new mitigations there's also user visible changes in /sys/devices/system/cpu/vulnerabilities/spectre_v2 to reflect these new mitigations. [ bp: Massage commit message, trim error messages, do more precise eIBRS mode checking. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Patrick Colp <patrick.colp@oracle.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCEPeter Zijlstra (Intel)1-11/+18
commit d45476d9832409371537013ebdd8dc1a7781f97a upstream. The RETPOLINE_AMD name is unfortunate since it isn't necessarily AMD only, in fact Hygon also uses it. Furthermore it will likely be sufficient for some Intel processors. Therefore rename the thing to RETPOLINE_LFENCE to better describe what it is. Add the spectre_v2=retpoline,lfence option as an alias to spectre_v2=retpoline,amd to preserve existing setups. However, the output of /sys/devices/system/cpu/vulnerabilities/spectre_v2 will be changed. [ bp: Fix typos, massage. ] Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.15] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86,bugs: Unconditionally allow spectre_v2=retpoline,amdPeter Zijlstra1-7/+0
commit f8a66d608a3e471e1202778c2a36cbdc96bae73b upstream. Currently Linux prevents usage of retpoline,amd on !AMD hardware, this is unfriendly and gets in the way of testing. Remove this restriction. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Josh Poimboeuf <jpoimboe@redhat.com> Tested-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20211026120310.487348118@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-16x86/sgx: Silence softlockup detection when releasing large enclavesReinette Chatre1-0/+2
commit 8795359e35bc33bf86b6d0765aa7f37431db3b9c upstream. Vijay reported that the "unclobbered_vdso_oversubscribed" selftest triggers the softlockup detector. Actual SGX systems have 128GB of enclave memory or more. The "unclobbered_vdso_oversubscribed" selftest creates one enclave which consumes all of the enclave memory on the system. Tearing down such a large enclave takes around a minute, most of it in the loop where the EREMOVE instruction is applied to each individual 4k enclave page. Spending one minute in a loop triggers the softlockup detector. Add a cond_resched() to give other tasks a chance to run and placate the softlockup detector. Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko@kernel.org> (kselftest as sanity check) Link: https://lkml.kernel.org/r/ced01cac1e75f900251b0a4ae1150aa8ebd295ec.1644345232.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/cpu: Add Xeon Icelake-D to list of CPUs that support PPINTony Luck1-0/+1
commit e464121f2d40eabc7d11823fb26db807ce945df4 upstream. Missed adding the Icelake-D CPU to the list. It uses the same MSRs to control and read the inventory number as all the other models. Fixes: dc6b025de95b ("x86/mce: Add Xeon Icelake to list of CPUs that support PPIN") Reported-by: Ailin Xu <ailin.xu@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220121174743.1875294-2-tony.luck@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01x86/MCE/AMD: Allow thresholding interface updates after initYazen Ghannam1-1/+1
commit 1f52b0aba6fd37653416375cb8a1ca673acf8d5f upstream. Changes to the AMD Thresholding sysfs code prevents sysfs writes from updating the underlying registers once CPU init is completed, i.e. "threshold_banks" is set. Allow the registers to be updated if the thresholding interface is already initialized or if in the init path. Use the "set_lvt_off" value to indicate if running in the init path, since this value is only set during init. Fixes: a037f3ca0ea0 ("x86/mce/amd: Make threshold bank setting hotplug robust") Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220117161328.19148-1-yazen.ghannam@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27x86/mce: Mark mce_read_aux() noinstrBorislav Petkov1-1/+1
[ Upstream commit db6c996d6ce45dfb44891f0824a65ecec216f47a ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0x681: call to mce_read_aux() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-10-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_end() noinstrBorislav Petkov1-3/+11
[ Upstream commit b4813539d37fa31fed62cdfab7bd2dd8929c5b2e ] It is called by the #MC handler which is noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xbd6: call to memset() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-9-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Mark mce_panic() noinstrBorislav Petkov1-3/+12
[ Upstream commit 3c7ce80a818fa7950be123cac80cd078e5ac1013 ] And allow instrumentation inside it because it does calls to other facilities which will not be tagged noinstr. Fixes vmlinux.o: warning: objtool: do_machine_check()+0xc73: call to mce_panic() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-8-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce: Allow instrumentation during task work queueingBorislav Petkov1-0/+11
[ Upstream commit 4fbce464db81a42f9a57ee242d6150ec7f996415 ] Fixes vmlinux.o: warning: objtool: do_machine_check()+0xdb1: call to queue_task_work() leaves .noinstr.text section Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20211208111343.8130-6-bp@alien8.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-01-27x86/mce/inject: Avoid out-of-bounds write when setting flagsZhang Zixun1-1/+1
[ Upstream commit de768416b203ac84e02a757b782a32efb388476f ] A contrived zero-length write, for example, by using write(2): ... ret = write(fd, str, 0); ... to the "flags" file causes: BUG: KASAN: stack-out-of-bounds in flags_write Write of size 1 at addr ffff888019be7ddf by task writefile/3787 CPU: 4 PID: 3787 Comm: writefile Not tainted 5.16.0-rc7+ #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 due to accessing buf one char before its start. Prevent such out-of-bounds access. [ bp: Productize into a proper patch. Link below is the next best thing because the original mail didn't get archived on lore. ] Fixes: 0451d14d0561 ("EDAC, mce_amd_inj: Modify flags attribute to use string arguments") Signed-off-by: Zhang Zixun <zhang133010@icloud.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/linux-edac/YcnePfF1OOqoQwrX@zn.tnic/ Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-08x86/hyperv: Move required MSRs check to initial platform probingSean Christopherson1-5/+15
[ Upstream commit f3e613e72f66226b3bea1046c1b864f67a3000a4 ] Explicitly check for MSR_HYPERCALL and MSR_VP_INDEX support when probing for running as a Hyper-V guest instead of waiting until hyperv_init() to detect the bogus configuration. Add messages to give the admin a heads up that they are likely running on a broken virtual machine setup. At best, silently disabling Hyper-V is confusing and difficult to debug, e.g. the kernel _says_ it's using all these fancy Hyper-V features, but always falls back to the native versions. At worst, the half baked setup will crash/hang the kernel. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20211104182239.1302956-3-seanjc@google.com Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-11-25x86/sgx: Fix free page accountingReinette Chatre1-6/+6
commit ac5d272a0ad0419f52e08c91953356e32b075af7 upstream. The SGX driver maintains a single global free page counter, sgx_nr_free_pages, that reflects the number of free pages available across all NUMA nodes. Correspondingly, a list of free pages is associated with each NUMA node and sgx_nr_free_pages is updated every time a page is added or removed from any of the free page lists. The main usage of sgx_nr_free_pages is by the reclaimer that runs when it (sgx_nr_free_pages) goes below a watermark to ensure that there are always some free pages available to, for example, support efficient page faults. With sgx_nr_free_pages accessed and modified from a few places it is essential to ensure that these accesses are done safely but this is not the case. sgx_nr_free_pages is read without any protection and updated with inconsistent protection by any one of the spin locks associated with the individual NUMA nodes. For example: CPU_A CPU_B ----- ----- spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); ... ... sgx_nr_free_pages--; /* NOT SAFE */ sgx_nr_free_pages--; spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); Since sgx_nr_free_pages may be protected by different spin locks while being modified from different CPUs, the following scenario is possible: CPU_A CPU_B ----- ----- {sgx_nr_free_pages = 100} spin_lock(&nodeA->lock); spin_lock(&nodeB->lock); sgx_nr_free_pages--; sgx_nr_free_pages--; /* LOAD sgx_nr_free_pages = 100 */ /* LOAD sgx_nr_free_pages = 100 */ /* sgx_nr_free_pages-- */ /* sgx_nr_free_pages-- */ /* STORE sgx_nr_free_pages = 99 */ /* STORE sgx_nr_free_pages = 99 */ spin_unlock(&nodeA->lock); spin_unlock(&nodeB->lock); In the above scenario, sgx_nr_free_pages is decremented from two CPUs but instead of sgx_nr_free_pages ending with a value that is two less than it started with, it was only decremented by one while the number of free pages were actually reduced by two. The consequence of sgx_nr_free_pages not being protected is that its value may not accurately reflect the actual number of free pages on the system, impacting the availability of free pages in support of many flows. The problematic scenario is when the reclaimer does not run because it believes there to be sufficient free pages while any attempt to allocate a page fails because there are no free pages available. In the SGX driver the reclaimer's watermark is only 32 pages so after encountering the above example scenario 32 times a user space hang is possible when there are no more free pages because of repeated page faults caused by no free pages made available. The following flow was encountered: asm_exc_page_fault ... sgx_vma_fault() sgx_encl_load_page() sgx_encl_eldu() // Encrypted page needs to be loaded from backing // storage into newly allocated SGX memory page sgx_alloc_epc_page() // Allocate a page of SGX memory __sgx_alloc_epc_page() // Fails, no free SGX memory ... if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer wake_up(&ksgxd_waitq); return -EBUSY; // Return -EBUSY giving reclaimer time to run return -EBUSY; return -EBUSY; return VM_FAULT_NOPAGE; The reclaimer is triggered in above flow with the following code: static bool sgx_should_reclaim(unsigned long watermark) { return sgx_nr_free_pages < watermark && !list_empty(&sgx_active_page_list); } In the problematic scenario there were no free pages available yet the value of sgx_nr_free_pages was above the watermark. The allocation of SGX memory thus always failed because of a lack of free pages while no free pages were made available because the reclaimer is never started because of sgx_nr_free_pages' incorrect value. The consequence was that user space kept encountering VM_FAULT_NOPAGE that caused the same address to be accessed repeatedly with the same result. Change the global free page counter to an atomic type that ensures simultaneous updates are done safely. While doing so, move the updating of the variable outside of the spin lock critical section to which it does not belong. Cc: stable@vger.kernel.org Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_alloc_epc_page()") Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/a95a40743bbd3f795b465f30922dde7f1ea9e0eb.1637004094.git.reinette.chatre@intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/mce: Add errata workaround for Skylake SKX37Dave Jones1-2/+3
commit e629fc1407a63dbb748f828f9814463ffc2a0af0 upstream. Errata SKX37 is word-for-word identical to the other errata listed in this workaround. I happened to notice this after investigating a CMCI storm on a Skylake host. While I can't confirm this was the root cause, spurious corrected errors does sound like a likely suspect. Fixes: 2976908e4198 ("x86/mce: Do not log spurious corrected mce errors") Signed-off-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211029205759.GA7385@codemonkey.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18x86/cpu: Fix migration safety with X86_BUG_NULL_SELJane Malalane4-7/+42
commit 415de44076640483648d6c0f6d645a9ee61328ad upstream. Currently, Linux probes for X86_BUG_NULL_SEL unconditionally which makes it unsafe to migrate in a virtualised environment as the properties across the migration pool might differ. To be specific, the case which goes wrong is: 1. Zen1 (or earlier) and Zen2 (or later) in a migration pool 2. Linux boots on Zen2, probes and finds the absence of X86_BUG_NULL_SEL 3. Linux is then migrated to Zen1 Linux is now running on a X86_BUG_NULL_SEL-impacted CPU while believing that the bug is fixed. The only way to address the problem is to fully trust the "no longer affected" CPUID bit when virtualised, because in the above case it would be clear deliberately to indicate the fact "you might migrate to somewhere which has this behaviour". Zen3 adds the NullSelectorClearsBase CPUID bit to indicate that loading a NULL segment selector zeroes the base and limit fields, as well as just attributes. Zen2 also has this behaviour but doesn't have the NSCB bit. [ bp: Minor touchups. ] Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Signed-off-by: Borislav Petkov <bp@suse.de> CC: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20211021104744.24126-1-jane.malalane@citrix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-10-06x86/entry: Clear X86_FEATURE_SMAP when CONFIG_X86_SMAP=nVegard Nossum1-0/+1
Commit 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") added a warning if AC is set when in the kernel. Commit 662a0221893a3d ("x86/entry: Fix AC assertion") changed the warning to only fire if the CPU supports SMAP. However, the warning can still trigger on a machine that supports SMAP but where it's disabled in the kernel config and when running the syscall_nt selftest, for example: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 49 at irqentry_enter_from_user_mode CPU: 0 PID: 49 Comm: init Tainted: G T 5.15.0-rc4+ #98 e6202628ee053b4f310759978284bd8bb0ce6905 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:irqentry_enter_from_user_mode ... Call Trace: ? irqentry_enter ? exc_general_protection ? asm_exc_general_protection ? asm_exc_general_protectio IS_ENABLED(CONFIG_X86_SMAP) could be added to the warning condition, but even this would not be enough in case SMAP is disabled at boot time with the "nosmap" parameter. To be consistent with "nosmap" behaviour, clear X86_FEATURE_SMAP when !CONFIG_X86_SMAP. Found using entry-fuzz + satrandconfig. [ bp: Massage commit message. ] Fixes: 3c73b81a9164 ("x86/entry, selftests: Further improve user entry sanity checks") Fixes: 662a0221893a ("x86/entry: Fix AC assertion") Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20211003223423.8666-1-vegard.nossum@oracle.com
2021-10-06x86/resctrl: Fix kfree() of the wrong type in domain_add_cpu()James Morse1-2/+2
Commit in Fixes separated the architecture specific and filesystem parts of the resctrl domain structures. This left the error paths in domain_add_cpu() kfree()ing the memory with the wrong type. This will cause a problem if someone adds a new member to struct rdt_hw_domain meaning d_resctrl is no longer the first member. Fixes: 792e0f6f789b ("x86/resctrl: Split struct rdt_domain") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/20210917165924.28254-1-james.morse@arm.com
2021-10-06x86/resctrl: Free the ctrlval arrays when domain_setup_mon_state() failsJames Morse1-0/+2
domain_add_cpu() is called whenever a CPU is brought online. The earlier call to domain_setup_ctrlval() allocates the control value arrays. If domain_setup_mon_state() fails, the control value arrays are not freed. Add the missing kfree() calls. Fixes: 1bd2a63b4f0de ("x86/intel_rdt/mba_sc: Add initialization support") Fixes: edf6fa1c4a951 ("x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management") Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210917165958.28313-1-james.morse@arm.com
2021-09-19Merge tag 'x86_urgent_for_v5.15_rc2' of ↵Linus Torvalds1-11/+32
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Prevent a infinite loop in the MCE recovery on return to user space, which was caused by a second MCE queueing work for the same page and thereby creating a circular work list. - Make kern_addr_valid() handle existing PMD entries, which are marked not present in the higher level page table, correctly instead of blindly dereferencing them. - Pass a valid address to sanitize_phys(). This was caused by the mixture of inclusive and exclusive ranges. memtype_reserve() expect 'end' being exclusive, but sanitize_phys() wants it inclusive. This worked so far, but with end being the end of the physical address space the fail is exposed. - Increase the maximum supported GPIO numbers for 64bit. Newer SoCs exceed the previous maximum. * tag 'x86_urgent_for_v5.15_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Avoid infinite loop for copy from user recovery x86/mm: Fix kern_addr_valid() to cope with existing but not present entries x86/platform: Increase maximum GPIO number for X86_64 x86/pat: Pass valid address to sanitize_phys()
2021-09-14x86/mce: Avoid infinite loop for copy from user recoveryTony Luck1-11/+32
There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an exception table entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the exception table entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically, the work is queued using the ->mce_kill_me callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user, it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. [ bp: Massage commit message, drop noinstr, fix typo, extend panic messages. ] Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.com
2021-09-11Merge branch 'linus' into smp/urgentThomas Gleixner1-22/+16
Ensure that all usage sites of get/put_online_cpus() except for the struggler in drivers/thermal are gone. So the last user and the deprecated inlines can be removed.
2021-09-02Merge tag 'hyperv-next-signed-20210831' of ↵Linus Torvalds1-22/+16
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux Pull hyperv updates from Wei Liu: - make Hyper-V code arch-agnostic (Michael Kelley) - fix sched_clock behaviour on Hyper-V (Ani Sinha) - fix a fault when Linux runs as the root partition on MSHV (Praveen Kumar) - fix VSS driver (Vitaly Kuznetsov) - cleanup (Sonia Sharma) * tag 'hyperv-next-signed-20210831' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux: hv_utils: Set the maximum packet size for VSS driver to the length of the receive buffer Drivers: hv: Enable Hyper-V code to be built on ARM64 arm64: efi: Export screen_info arm64: hyperv: Initialize hypervisor on boot arm64: hyperv: Add panic handler arm64: hyperv: Add Hyper-V hypercall and register access utilities x86/hyperv: fix root partition faults when writing to VP assist page MSR hv: hyperv.h: Remove unused inline functions drivers: hv: Decouple Hyper-V clock/timer code from VMbus drivers x86/hyperv: add comment describing TSC_INVARIANT_CONTROL MSR setting bit 0 Drivers: hv: Move Hyper-V misc functionality to arch-neutral code Drivers: hv: Add arch independent default functions for some Hyper-V handlers Drivers: hv: Make portions of Hyper-V init code be arch neutral x86/hyperv: fix for unwanted manipulation of sched_clock when TSC marked unstable asm-generic/hyperv: Add missing #include of nmi.h
2021-09-01drivers: base: cacheinfo: Get rid of DEFINE_SMP_CALL_CACHE_FUNCTION()Thomas Gleixner1-5/+2
DEFINE_SMP_CALL_CACHE_FUNCTION() was usefel before the CPU hotplug rework to ensure that the cache related functions are called on the upcoming CPU because the notifier itself could run on any online CPU. The hotplug state machine guarantees that the callbacks are invoked on the upcoming CPU. So there is no need to have this SMP function call obfuscation. That indirection was missed when the hotplug notifiers were converted. This also solves the problem of ARM64 init_cache_level() invoking ACPI functions which take a semaphore in that context. That's invalid as SMP function calls run with interrupts disabled. Running it just from the callback in context of the CPU hotplug thread solves this. Fixes: 8571890e1513 ("arm64: Add support for ACPI based firmware tables") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Will Deacon <will@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/871r69ersb.ffs@tglx
2021-08-31Merge tag 'x86-cpu-2021-08-30' of ↵Linus Torvalds1-0/+70
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cache flush updates from Thomas Gleixner: "A reworked version of the opt-in L1D flush mechanism. This is a stop gap for potential future speculation related hardware vulnerabilities and a mechanism for truly security paranoid applications. It allows a task to request that the L1D cache is flushed when the kernel switches to a different mm. This can be requested via prctl(). Changes vs the previous versions: - Get rid of the software flush fallback - Make the handling consistent with other mitigations - Kill the task when it ends up on a SMT enabled core which defeats the purpose of L1D flushing obviously" * tag 'x86-cpu-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Documentation: Add L1D flushing Documentation x86, prctl: Hook L1D flushing in via prctl x86/mm: Prepare for opt-in based L1D flush in switch_mm() x86/process: Make room for TIF_SPEC_L1D_FLUSH sched: Add task_work callback for paranoid L1D flush x86/mm: Refactor cond_ibpb() to support other use cases x86/smp: Add a per-cpu view of SMT state
2021-08-30Merge tag 'perf-core-2021-08-30' of ↵Linus Torvalds2-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf event updates from Ingo Molnar: - Add support for Intel Sapphire Rapids server CPU uncore events - Allow the AMD uncore driver to be built as a module - Misc cleanups and fixes * tag 'perf-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) perf/x86/amd/ibs: Add bitfield definitions in new <asm/amd-ibs.h> header perf/amd/uncore: Allow the driver to be built as a module x86/cpu: Add get_llc_id() helper function perf/amd/uncore: Clean up header use, use <linux/ include paths instead of <asm/ perf/amd/uncore: Simplify code, use free_percpu()'s built-in check for NULL perf/hw_breakpoint: Replace deprecated CPU-hotplug functions perf/x86/intel: Replace deprecated CPU-hotplug functions perf/x86: Remove unused assignment to pointer 'e' perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX perf/x86/intel/uncore: Support IMC free-running counters on Sapphire Rapids server perf/x86/intel/uncore: Support IIO free-running counters on Sapphire Rapids server perf/x86/intel/uncore: Factor out snr_uncore_mmio_map() perf/x86/intel/uncore: Add alias PMU name perf/x86/intel/uncore: Add Sapphire Rapids server MDF support perf/x86/intel/uncore: Add Sapphire Rapids server M3UPI support perf/x86/intel/uncore: Add Sapphire Rapids server UPI support perf/x86/intel/uncore: Add Sapphire Rapids server M2M support perf/x86/intel/uncore: Add Sapphire Rapids server IMC support perf/x86/intel/uncore: Add Sapphire Rapids server PCU support perf/x86/intel/uncore: Add Sapphire Rapids server M2PCIe support ...
2021-08-30Merge tag 'x86_cleanups_for_v5.15' of ↵Linus Torvalds3-17/+17
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "The usual round of minor cleanups and fixes" * tag 'x86_cleanups_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kaslr: Have process_mem_region() return a boolean x86/power: Fix kernel-doc warnings in cpu.c x86/mce/inject: Replace deprecated CPU-hotplug functions. x86/microcode: Replace deprecated CPU-hotplug functions. x86/mtrr: Replace deprecated CPU-hotplug functions. x86/mmiotrace: Replace deprecated CPU-hotplug functions.
2021-08-30Merge tag 'x86_cache_for_v5.15' of ↵Linus Torvalds6-595/+592
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 resource control updates from Borislav Petkov: "A first round of changes towards splitting the arch-specific bits from the filesystem bits of resctrl, the ultimate goal being to support ARM's equivalent technology MPAM, with the same fs interface (James Morse)" * tag 'x86_cache_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits) x86/resctrl: Make resctrl_arch_get_config() return its value x86/resctrl: Merge the CDP resources x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param range x86/resctrl: Remove rdt_cdp_peer_get() x86/resctrl: Merge the ctrl_val arrays x86/resctrl: Calculate the index from the configuration type x86/resctrl: Apply offset correction when config is staged x86/resctrl: Make ctrlval arrays the same size x86/resctrl: Pass configuration type to resctrl_arch_get_config() x86/resctrl: Add a helper to read a closid's configuration x86/resctrl: Rename update_domains() to resctrl_arch_update_domains() x86/resctrl: Allow different CODE/DATA configurations to be staged x86/resctrl: Group staged configuration into a separate struct x86/resctrl: Move the schemata names into struct resctrl_schema x86/resctrl: Add a helper to read/set the CDP configuration x86/resctrl: Swizzle rdt_resource and resctrl_schema in pseudo_lock_region x86/resctrl: Pass the schema to resctrl filesystem functions x86/resctrl: Add resctrl_arch_get_num_closid() x86/resctrl: Store the effective num_closid in the schema x86/resctrl: Walk the resctrl schema list instead of an arch list ...
2021-08-30Merge tag 'ras_core_for_v5.15' of ↵Linus Torvalds1-3/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS update from Borislav Petkov: "A single RAS change for 5.15: - Do not start processing MCEs logged early because the decoding chain is not up yet - delay that processing until everything is ready" * tag 'ras_core_for_v5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Defer processing of early errors
2021-08-26x86/cpu: Add get_llc_id() helper functionKim Phillips2-1/+7
Factor out a helper function rather than export cpu_llc_id, which is needed in order to be able to build the AMD uncore driver as a module. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210817221048.88063-7-kim.phillips@amd.com
2021-08-24x86/mce: Defer processing of early errorsBorislav Petkov1-3/+8
When a fatal machine check results in a system reset, Linux does not clear the error(s) from machine check bank(s) - hardware preserves the machine check banks across a warm reset. During initialization of the kernel after the reboot, Linux reads, logs, and clears all machine check banks. But there is a problem. In: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") the call to mce_register_decode_chain() moved later in the boot sequence. This means that /dev/mcelog doesn't see those early error logs. This was partially fixed by: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") which made sure that the logs were not lost completely by printing to the console. But parsing console logs is error prone. Users of /dev/mcelog should expect to find any early errors logged to standard places. Add a new flag MCP_QUEUE_LOG to machine_check_poll() to be used in early machine check initialization to indicate that any errors found should just be queued to genpool. When mcheck_late_init() is called it will call mce_schedule_work() to actually log and flush any errors queued in the genpool. [ Based on an original patch, commit message by and completely productized by Tony Luck. ] Fixes: 5de97c9f6d85 ("x86/mce: Factor out and deprecate the /dev/mcelog driver") Reported-by: Sumanth Kamatala <skamatala@juniper.net> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210824003129.GA1642753@agluck-desk2.amr.corp.intel.com
2021-08-22x86/resctrl: Fix a maybe-uninitialized build warning treated as errorBabu Moger1-0/+6
The recent commit 064855a69003 ("x86/resctrl: Fix default monitoring groups reporting") caused a RHEL build failure with an uninitialized variable warning treated as an error because it removed the default case snippet. The RHEL Makefile uses '-Werror=maybe-uninitialized' to force possibly uninitialized variable warnings to be treated as errors. This is also reported by smatch via the 0day robot. The error from the RHEL build is: arch/x86/kernel/cpu/resctrl/monitor.c: In function ‘__mon_event_count’: arch/x86/kernel/cpu/resctrl/monitor.c:261:12: error: ‘m’ may be used uninitialized in this function [-Werror=maybe-uninitialized] m->chunks += chunks; ^~ The upstream Makefile does not build using '-Werror=maybe-uninitialized'. So, the problem is not seen there. Fix the problem by putting back the default case snippet. [ bp: note that there's nothing wrong with the code and other compilers do not trigger this warning - this is being done just so the RHEL compiler is happy. ] Fixes: 064855a69003 ("x86/resctrl: Fix default monitoring groups reporting") Reported-by: Terry Bowman <Terry.Bowman@amd.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/162949631908.23903.17090272726012848523.stgit@bmoger-ubuntu
2021-08-12x86/resctrl: Fix default monitoring groups reportingBabu Moger1-14/+13
Creating a new sub monitoring group in the root /sys/fs/resctrl leads to getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes on the entire filesystem. Steps to reproduce: 1. mount -t resctrl resctrl /sys/fs/resctrl/ 2. cd /sys/fs/resctrl/ 3. cat mon_data/mon_L3_00/mbm_total_bytes 23189832 4. Create sub monitor group: mkdir mon_groups/test1 5. cat mon_data/mon_L3_00/mbm_total_bytes Unavailable When a new monitoring group is created, a new RMID is assigned to the new group. But the RMID is not active yet. When the events are read on the new RMID, it is expected to report the status as "Unavailable". When the user reads the events on the default monitoring group with multiple subgroups, the events on all subgroups are consolidated together. Currently, if any of the RMID reads report as "Unavailable", then everything will be reported as "Unavailable". Fix the issue by discarding the "Unavailable" reads and reporting all the successful RMID reads. This is not a problem on Intel systems as Intel reports 0 on Inactive RMIDs. Fixes: d89b7379015f ("x86/intel_rdt/cqm: Add mon_data") Reported-by: Paweł Szulik <pawel.szulik@intel.com> Signed-off-by: Babu Moger <Babu.Moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311 Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
2021-08-11x86/resctrl: Make resctrl_arch_get_config() return its valueJames Morse3-16/+19
resctrl_arch_get_config() has no return, but does pass a single value back via one of its arguments. Return the value instead. Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210811163831.14917-1-james.morse@arm.com
2021-08-11x86/resctrl: Merge the CDP resourcesJames Morse3-239/+85
resctrl uses struct rdt_resource to describe the available hardware resources. The domains of the CDP aliases share a single ctrl_val[] array. The only differences between the struct rdt_hw_resource aliases is the name and conf_type. The name from struct rdt_hw_resource is visible to user-space. To support another architecture, as many user-visible details should be handled in the filesystem parts of the code that is common to all architectures. The name and conf_type go together. Remove conf_type and the CDP aliases. When CDP is supported and enabled, schemata_list_create() can create two schemata using the single resource, generating the CODE/DATA suffix to the schema name itself. This allows the alloc_ctrlval_array() and complications around free()ing the ctrl_val arrays to be removed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-25-james.morse@arm.com
2021-08-11x86/resctrl: Expand resctrl_arch_update_domains()'s msr_param rangeJames Morse1-3/+9
resctrl_arch_update_domains() specifies the one closid that has been modified and needs copying to the hardware. resctrl_arch_update_domains() takes a struct rdt_resource and a closid as arguments, but copies all the staged configurations for that closid into the ctrl_val[] array. resctrl_arch_update_domains() is called once per schema, but once the resources and domains are merged, the second call of a L2CODE/L2DATA pair will find no staged configurations, as they were previously applied. The msr_param of the first call only has one index, so would only have update the hardware for the last staged configuration. To avoid a second round of IPIs when changing L2CODE and L2DATA in one go, expand the range of the msr_param if multiple staged configurations are found. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-24-james.morse@arm.com
2021-08-11x86/resctrl: Remove rdt_cdp_peer_get()James Morse1-85/+14
When CDP is enabled, rdt_cdp_peer_get() finds the alternative CODE/DATA resource and returns the alternative domain. This is used to determine if bitmaps overlap when there are aliased entries in the two struct rdt_hw_resources. Now that the ctrl_val[] used by the CODE/DATA resources is the same, the search for an alternate resource/domain is not needed. Replace rdt_cdp_peer_get() with resctrl_peer_type(), which returns the alternative type. This can be passed to resctrl_arch_get_config() with the same resource and domain. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-23-james.morse@arm.com
2021-08-11x86/resctrl: Merge the ctrl_val arraysJames Morse1-4/+61
Each struct rdt_hw_resource has its own ctrl_val[] array. When CDP is enabled, two resources are in use, each with its own ctrl_val[] array that holds half of the configuration used by hardware. One uses the odd slots, the other the even. rdt_cdp_peer_get() is the helper to find the alternate resource, its domain, and corresponding entry in the other ctrl_val[] array. Once the CDP resources are merged there will be one struct rdt_hw_resource and one ctrl_val[] array for each hardware resource. This will include changes to rdt_cdp_peer_get(), making it hard to bisect any issue. Merge the ctrl_val[] arrays for three CODE/DATA/NONE resources first. Doing this before merging the resources temporarily complicates allocating and freeing the ctrl_val arrays. Add a helper to allocate the ctrl_val array, that returns the value on the L2 or L3 resource if it already exists. This gets removed once the resources are merged, and there really is only one ctrl_val[] array. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-22-james.morse@arm.com
2021-08-11x86/resctrl: Calculate the index from the configuration typeJames Morse2-22/+15
resctrl uses cbm_idx() to map a closid to an index in the configuration array. This is based on a multiplier and offset that are held in the resource. To merge the resources, the resctrl arch code needs to calculate the index from something else, as there will only be one resource. Decide based on the staged configuration type. This makes the static mult and offset parameters redundant. [ bp: Remove superfluous brackets in get_config_index() ] Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-21-james.morse@arm.com
2021-08-11x86/resctrl: Apply offset correction when config is stagedJames Morse4-31/+27
When resctrl comes to copy the CAT MSR values from the ctrl_val[] array into hardware, it applies an offset adjustment based on the type of the resource. CODE and DATA resources have their closid mapped into an odd/even range. This mapping is based on a property of the resource. This happens once the new control value has been written to the ctrl_val[] array. Once the CDP resources are merged, there will only be a single property that needs to cover both odd/even mappings to the single ctrl_val[] array. The offset adjustment must be applied before the new value is written to the array. Move the logic from cat_wrmsr() to resctrl_arch_update_domains(). The value provided to apply_config() is now an index in the array, not the closid. The parameters provided via struct msr_param are now indexes too. As resctrl's use of closid is a u32, struct msr_param's type is changed to match. With this, the CODE and DATA resources only use the odd or even indexes in the array. This allows the temporary num_closid/2 fixes in domain_setup_ctrlval() and reset_all_ctrls() to be removed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-20-james.morse@arm.com
2021-08-11x86/resctrl: Make ctrlval arrays the same sizeJames Morse2-1/+18
The CODE and DATA resources report a num_closid that is half the actual size supported by the hardware. This behaviour is visible to user-space when CDP is enabled. The CODE and DATA resources have their own ctrlval arrays which are half the size of the underlying hardware because num_closid was already adjusted. One holds the odd configurations values, the other even. Before the CDP resources can be merged, the 'half the closids' behaviour needs to be implemented by schemata_list_create(), but this causes the ctrl_val[] array to be full sized. Remove the logic from the architecture specific rdt_get_cdp_config() setup, and add it to schemata_list_create(). Functions that walk all the configurations, such as domain_setup_ctrlval() and reset_all_ctrls(), take num_closid directly from struct rdt_hw_resource also have to halve num_closid as only the lower half of each array is in use. domain_setup_ctrlval() and reset_all_ctrls() both copy struct rdt_hw_resource's num_closid to a struct msr_param. Correct the value here. This is temporary as a subsequent patch will merge all three ctrl_val[] arrays such that when CDP is in use, the CODA/DATA layout in the array matches the hardware. reset_all_ctrls()'s loop over the whole of ctrl_val[] is not touched as this is harmless, and will be required as it is once the resources are merged. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-19-james.morse@arm.com
2021-08-11x86/resctrl: Pass configuration type to resctrl_arch_get_config()James Morse3-15/+27
The ctrl_val[] array for a struct rdt_hw_resource only holds configurations of one type. The type is implicit. Once the CDP resources are merged, the ctrl_val[] array will hold all the configurations for the hardware resource. When a particular type of configuration is needed, it must be specified explicitly. Pass the expected type from the schema into resctrl_arch_get_config(). Nothing uses this yet, but once a single ctrl_val[] array is used for the three struct rdt_hw_resources that share hardware, the type will be used to return the correct configuration value from the shared array. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-18-james.morse@arm.com
2021-08-11x86/resctrl: Add a helper to read a closid's configurationJames Morse3-30/+35
Functions like show_doms() reach into the architecture's private structure to retrieve the configuration from the struct rdt_hw_resource. The hardware configuration may look completely different to the values resctrl gets from user-space. The staged configuration and resctrl_arch_update_domains() allow the architecture to convert or translate these values. Resctrl shouldn't read or write the ctrl_val[] values directly. Add a helper to read the current configuration. This will allow another architecture to scale the bitmaps if necessary, and possibly use controls that don't take the user-space control format at all. Of the remaining functions that access ctrl_val[] directly, apply_config() is part of the architecture-specific code, and is called via resctrl_arch_update_domains(). reset_all_ctrls() will be an architecture specific helper. update_mba_bw() manipulates both ctrl_val[], mbps_val[] and the hardware. The mbps_val[] that matches the mba_sc state of the resource is changed, but the other is left unchanged. Abstracting this is the subject of later patches that affect set_mba_sc() too. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-17-james.morse@arm.com
2021-08-11x86/resctrl: Rename update_domains() to resctrl_arch_update_domains()James Morse3-4/+3
update_domains() merges the staged configuration changes into the arch codes configuration array. Rename to make it clear it is part of the arch code interface to resctrl. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <jamie@nuviainc.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Babu Moger <babu.moger@amd.com> Link: https://lkml.kernel.org/r/20210728170637.25610-16-james.morse@arm.com