summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
AgeCommit message (Collapse)AuthorFilesLines
2025-10-29fs: Add 'initramfs_options' to set initramfs mount optionsLichen Liu1-0/+3
[ Upstream commit 278033a225e13ec21900f0a92b8351658f5377f2 ] When CONFIG_TMPFS is enabled, the initial root filesystem is a tmpfs. By default, a tmpfs mount is limited to using 50% of the available RAM for its content. This can be problematic in memory-constrained environments, particularly during a kdump capture. In a kdump scenario, the capture kernel boots with a limited amount of memory specified by the 'crashkernel' parameter. If the initramfs is large, it may fail to unpack into the tmpfs rootfs due to insufficient space. This is because to get X MB of usable space in tmpfs, 2*X MB of memory must be available for the mount. This leads to an OOM failure during the early boot process, preventing a successful crash dump. This patch introduces a new kernel command-line parameter, initramfs_options, which allows passing specific mount options directly to the rootfs when it is first mounted. This gives users control over the rootfs behavior. For example, a user can now specify initramfs_options=size=75% to allow the tmpfs to use up to 75% of the available memory. This can significantly reduce the memory pressure for kdump. Consider a practical example: To unpack a 48MB initramfs, the tmpfs needs 48MB of usable space. With the default 50% limit, this requires a memory pool of 96MB to be available for the tmpfs mount. The total memory requirement is therefore approximately: 16MB (vmlinuz) + 48MB (loaded initramfs) + 48MB (unpacked kernel) + 96MB (for tmpfs) + 12MB (runtime overhead) ≈ 220MB. By using initramfs_options=size=75%, the memory pool required for the 48MB tmpfs is reduced to 48MB / 0.75 = 64MB. This reduces the total memory requirement by 32MB (96MB - 64MB), allowing the kdump to succeed with a smaller crashkernel size, such as 192MB. An alternative approach of reusing the existing rootflags parameter was considered. However, a new, dedicated initramfs_options parameter was chosen to avoid altering the current behavior of rootflags (which applies to the final root filesystem) and to prevent any potential regressions. Also add documentation for the new kernel parameter "initramfs_options" This approach is inspired by prior discussions and patches on the topic. Ref: https://www.lightofdawn.org/blog/?viewDetailed=00128 Ref: https://landley.net/notes-2015.html#01-01-2015 Ref: https://lkml.org/lkml/2021/6/29/783 Ref: https://www.kernel.org/doc/html/latest/filesystems/ramfs-rootfs-initramfs.html#what-is-rootfs Signed-off-by: Lichen Liu <lichliu@redhat.com> Link: https://lore.kernel.org/20250815121459.3391223-1-lichliu@redhat.com Tested-by: Rob Landley <rob@landley.net> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-09-11x86/vmscape: Enable the mitigationPawan Gupta1-0/+11
commit 556c1ad666ad90c50ec8fccb930dd5046cfbecfb upstream. Enable the previously added mitigation for VMscape. Add the cmdline vmscape={off|ibpb|force} and sysfs reporting. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-09-11Documentation/hw-vuln: Add VMSCAPE documentationPawan Gupta2-0/+111
commit 9969779d0803f5dcd4460ae7aca2bc3fd91bff12 upstream. VMSCAPE is a vulnerability that may allow a guest to influence the branch prediction in host userspace, particularly affecting hypervisors like QEMU. Add the documentation. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Amit Shah <amit.shah@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17x86/bugs: Add a Transient Scheduler Attacks mitigationBorislav Petkov1-0/+13
From: "Borislav Petkov (AMD)" <bp@alien8.de> Commit d8010d4ba43e9f790925375a7de100604a5e2dba upstream. Add the required features detection glue to bugs.c et all in order to support the TSA mitigation. Co-developed-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17x86/bugs: Rename MDS machinery to something more genericBorislav Petkov1-3/+1
From: "Borislav Petkov (AMD)" <bp@alien8.de> Commit f9af88a3d384c8b55beb5dc5483e5da0135fadbd upstream. It will be used by other x86 mitigations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17x86/its: Add "vmexit" option to skip mitigation on some CPUsPawan Gupta1-0/+2
commit 2665281a07e19550944e8354a2024635a7b2714a upstream. Ice Lake generation CPUs are not affected by guest/host isolation part of ITS. If a user is only concerned about KVM guests, they can now choose a new cmdline option "vmexit" that will not deploy the ITS mitigation when CPU is not affected by guest/host isolation. This saves the performance overhead of ITS mitigation on Ice Lake gen CPUs. When "vmexit" option selected, if the CPU is affected by ITS guest/host isolation, the default ITS mitigation is deployed. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17x86/its: Enable Indirect Target Selection mitigationPawan Gupta1-0/+13
commit f4818881c47fd91fcb6d62373c57c7844e3de1c0 upstream. Indirect Target Selection (ITS) is a bug in some pre-ADL Intel CPUs with eIBRS. It affects prediction of indirect branch and RETs in the lower half of cacheline. Due to ITS such branches may get wrongly predicted to a target of (direct or indirect) branch that is located in the upper half of the cacheline. Scope of impact =============== Guest/host isolation -------------------- When eIBRS is used for guest/host isolation, the indirect branches in the VMM may still be predicted with targets corresponding to branches in the guest. Intra-mode ---------- cBPF or other native gadgets can be used for intra-mode training and disclosure using ITS. User/kernel isolation --------------------- When eIBRS is enabled user/kernel isolation is not impacted. Indirect Branch Prediction Barrier (IBPB) ----------------------------------------- After an IBPB, indirect branches may be predicted with targets corresponding to direct branches which were executed prior to IBPB. This is mitigated by a microcode update. Add cmdline parameter indirect_target_selection=off|on|force to control the mitigation to relocate the affected branches to an ITS-safe thunk i.e. located in the upper half of cacheline. Also add the sysfs reporting. When retpoline mitigation is deployed, ITS safe-thunks are not needed, because retpoline sequence is already ITS-safe. Similarly, when call depth tracking (CDT) mitigation is deployed (retbleed=stuff), ITS safe return thunk is not used, as CDT prevents RSB-underflow. To not overcomplicate things, ITS mitigation is not supported with spectre-v2 lfence;jmp mitigation. Moreover, it is less practical to deploy lfence;jmp mitigation on ITS affected parts anyways. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-07-17Documentation: x86/bugs/its: Add ITS documentationPawan Gupta2-0/+157
commit 1ac116ce6468670eeda39345a5585df308243dca upstream. Add the admin-guide for Indirect Target Selection (ITS). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27arm64: spectre: increase parameters that can be used to turn off bhb ↵Liu Song1-0/+5
mitigation individually [ Upstream commit 877ace9eab7de032f954533afd5d1ecd0cf62eaf ] In our environment, it was found that the mitigation BHB has a great impact on the benchmark performance. For example, in the lmbench test, the "process fork && exit" test performance drops by 20%. So it is necessary to have the ability to turn off the mitigation individually through cmdline, thus avoiding having to compile the kernel by adjusting the config. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: https://lore.kernel.org/r/1661514050-22263-1-git-send-email-liusong@linux.alibaba.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Pu Lehui <pulehui@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27Revert "x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2" ↵Breno Leitao1-2/+0
on v6.6 and older This reverts commit 0aa2553778b7531c70de77d729a38aea77626544 which is commit 98fdaeb296f51ef08e727a7cc72e5b5c864c4f4d upstream. commit 7adb96687ce8 ("x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2") depends on commit 72c70f480a70 ("x86/bugs: Add a separate config for Spectre V2"), which introduced MITIGATION_SPECTRE_V2. commit 72c70f480a70 ("x86/bugs: Add a separate config for Spectre V2") never landed in stable tree, thus, stable tree doesn't have MITIGATION_SPECTRE_V2, that said, commit 7adb96687ce8 ("x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2") has no value if the dependecy was not applied. Revert commit 7adb96687ce8 ("x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2") in stable kernel which landed in in 5.4.294, 5.10.238, 5.15.185, 6.1.141 and 6.6.93 stable versions. Cc: David.Kaplan@amd.com Cc: peterz@infradead.org Cc: pawan.kumar.gupta@linux.intel.com Cc: mingo@kernel.org Cc: brad.spengler@opensrcsec.com Cc: stable@vger.kernel.org # 6.6 6.1 5.15 5.10 5.4 Reported-by: Brad Spengler <brad.spengler@opensrcsec.com> Reported-by: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-04x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2Breno Leitao1-0/+2
[ Upstream commit 98fdaeb296f51ef08e727a7cc72e5b5c864c4f4d ] Change the default value of spectre v2 in user mode to respect the CONFIG_MITIGATION_SPECTRE_V2 config option. Currently, user mode spectre v2 is set to auto (SPECTRE_V2_USER_CMD_AUTO) by default, even if CONFIG_MITIGATION_SPECTRE_V2 is disabled. Set the spectre_v2 value to auto (SPECTRE_V2_USER_CMD_AUTO) if the Spectre v2 config (CONFIG_MITIGATION_SPECTRE_V2) is enabled, otherwise set the value to none (SPECTRE_V2_USER_CMD_NONE). Important to say the command line argument "spectre_v2_user" overwrites the default value in both cases. When CONFIG_MITIGATION_SPECTRE_V2 is not set, users have the flexibility to opt-in for specific mitigations independently. In this scenario, setting spectre_v2= will not enable spectre_v2_user=, and command line options spectre_v2_user and spectre_v2 are independent when CONFIG_MITIGATION_SPECTRE_V2=n. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Kaplan <David.Kaplan@amd.com> Link: https://lore.kernel.org/r/20241031-x86_bugs_last_v2-v2-2-b7ff1dab840e@debian.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2025-03-13clocksource: Limit number of CPUs checked for clock synchronizationPaul E. McKenney1-0/+10
[ Upstream commit fa218f1cce6ba40069c8daab8821de7e6be1cdd0 ] Currently, if skew is detected on a clock marked CLOCK_SOURCE_VERIFY_PERCPU, that clock is checked on all CPUs. This is thorough, but might not be what you want on a system with a few tens of CPUs, let alone a few hundred of them. Therefore, by default check only up to eight randomly chosen CPUs. Also provide a new clocksource.verify_n_cpus kernel boot parameter. A value of -1 says to check all of the CPUs, and a non-negative value says to randomly select that number of CPUs, without concern about selecting the same CPU multiple times. However, make use of a cpumask so that a given CPU will be checked at most once. Suggested-by: Thomas Gleixner <tglx@linutronix.de> # For verify_n_cpus=1. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-3-paulmck@kernel.org Stable-dep-of: 6bb05a33337b ("clocksource: Use migrate_disable() to avoid calling get_random_u32() in atomic context") Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-10-17proc: add config & param to block forcing mem writesAdrian Ratiu1-0/+10
[ Upstream commit 41e8149c8892ed1962bd15350b3c3e6e90cba7f4 ] This adds a Kconfig option and boot param to allow removing the FOLL_FORCE flag from /proc/pid/mem write calls because it can be abused. The traditional forcing behavior is kept as default because it can break GDB and some other use cases. Previously we tried a more sophisticated approach allowing distributions to fine-tune /proc/pid/mem behavior, however that got NAK-ed by Linus [1], who prefers this simpler approach with semantics also easier to understand for users. Link: https://lore.kernel.org/lkml/CAHk-=wiGWLChxYmUA5HrT5aopZrB7_2VTa0NLZcxORgkUe5tEQ@mail.gmail.com/ [1] Cc: Doug Anderson <dianders@chromium.org> Cc: Jeff Xu <jeffxu@google.com> Cc: Jann Horn <jannh@google.com> Cc: Kees Cook <kees@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Adrian Ratiu <adrian.ratiu@collabora.com> Link: https://lore.kernel.org/r/20240802080225.89408-1-adrian.ratiu@collabora.com Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2024-04-13x86/cpu: Enable STIBP on AMD if Automatic IBRS is enabledKim Phillips1-4/+7
commit fd470a8beed88440b160d690344fbae05a0b9b1b upstream. Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not provide protection to processes running at CPL3/user mode, see section "Extended Feature Enable Register (EFER)" in the APM v2 at https://bugzilla.kernel.org/attachment.cgi?id=304652 Explicitly enable STIBP to protect against cross-thread CPL3 branch target injections on systems with Automatic IBRS enabled. Also update the relevant documentation. Fixes: e7862eda309e ("x86/cpu: Support AMD Automatic IBRS") Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20230720194727.67022-1-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13x86/rfds: Mitigate Register File Data Sampling (RFDS)Pawan Gupta1-0/+21
commit 8076fcde016c9c0e0660543e67bff86cb48a7c9c upstream. RFDS is a CPU vulnerability that may allow userspace to infer kernel stale data previously used in floating point registers, vector registers and integer registers. RFDS only affects certain Intel Atom processors. Intel released a microcode update that uses VERW instruction to clear the affected CPU buffers. Unlike MDS, none of the affected cores support SMT. Add RFDS bug infrastructure and enable the VERW based mitigation by default, that clears the affected buffers just before exiting to userspace. Also add sysfs reporting and cmdline parameter "reg_file_data_sampling" to control the mitigation. For details see: Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst [ pawan: - Resolved conflicts in sysfs reporting. - s/ATOM_GRACEMONT/ALDERLAKE_N/ATOM_GRACEMONT is called ALDERLAKE_N in 6.6. ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13Documentation/hw-vuln: Add documentation for RFDSPawan Gupta2-0/+105
commit 4e42765d1be01111df0c0275bbaf1db1acef346e upstream. Add the documentation for transient execution vulnerability Register File Data Sampling (RFDS) that affects Intel Atom CPUs. [ pawan: s/ATOM_GRACEMONT/ALDERLAKE_N/ ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13x86/cpu: Support AMD Automatic IBRSKim Phillips2-6/+6
commit e7862eda309ecfccc36bb5558d937ed3ace07f3f upstream. The AMD Zen4 core supports a new feature called Automatic IBRS. It is a "set-and-forget" feature that means that, like Intel's Enhanced IBRS, h/w manages its IBRS mitigation resources automatically across CPL transitions. The feature is advertised by CPUID_Fn80000021_EAX bit 8 and is enabled by setting MSR C000_0080 (EFER) bit 21. Enable Automatic IBRS by default if the CPU feature is present. It typically provides greater performance over the incumbent generic retpolines mitigation. Reuse the SPECTRE_V2_EIBRS spectre_v2_mitigation enum. AMD Automatic IBRS and Intel Enhanced IBRS have similar enablement. Add NO_EIBRS_PBRSB to cpu_vuln_whitelist, since AMD Automatic IBRS isn't affected by PBRSB-eIBRS. The kernel command line option spectre_v2=eibrs is used to select AMD Automatic IBRS, if available. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Sean Christopherson <seanjc@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20230124163319.2277355-8-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-04-13Documentation/hw-vuln: Update spectre docLin Yujun1-0/+1
commit 06cb31cc761823ef444ba4e1df11347342a6e745 upstream. commit 7c693f54c873691 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS") adds the "ibrs " option in Documentation/admin-guide/kernel-parameters.txt but omits it to Documentation/admin-guide/hw-vuln/spectre.rst, add it. Signed-off-by: Lin Yujun <linyujun809@huawei.com> Link: https://lore.kernel.org/r/20220830123614.23007-1-linyujun809@huawei.com Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-25Documentation: sysctl: align cells in second content columnBagas Sanjaya1-9/+9
commit 1faa34672f8a17a3e155e74bde9648564e9480d6 upstream. Stephen Rothwell reported htmldocs warning when merging net-next tree: Documentation/admin-guide/sysctl/net.rst:37: WARNING: Malformed table. Text in column margin in table line 4. ========= =================== = ========== ================== Directory Content Directory Content ========= =================== = ========== ================== 802 E802 protocol mptcp Multipath TCP appletalk Appletalk protocol netfilter Network Filter ax25 AX25 netrom NET/ROM bridge Bridging rose X.25 PLP layer core General parameter tipc TIPC ethernet Ethernet protocol unix Unix domain sockets ipv4 IP version 4 x25 X.25 protocol ipv6 IP version 6 ========= =================== = ========== ================== The warning above is caused by cells in second "Content" column of /proc/sys/net subdirectory table which are in column margin. Align these cells against the column header to fix the warning. Link: https://lore.kernel.org/linux-next/20220823134905.57ed08d5@canb.auug.org.au/ Fixes: 1202cdd665315c ("Remove DECnet support from kernel") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20220824035804.204322-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-26x86/cpu: Rename srso_(.*)_alias to srso_alias_\1Peter Zijlstra1-2/+2
commit 42be649dd1f2eee6b1fb185f1a231b9494cf095f upstream. For a more consistent namespace. [ bp: Fixup names in the doc too. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.976236447@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11Documentation: security-bugs.rst: clarify CVE handlingGreg Kroah-Hartman1-7/+6
commit 3c1897ae4b6bc7cc586eda2feaa2cd68325ec29c upstream. The kernel security team does NOT assign CVEs, so document that properly and provide the "if you want one, ask MITRE for it" response that we give on a weekly basis in the document, so we don't have to constantly say it to everyone who asks. Link: https://lore.kernel.org/r/2023063022-retouch-kerosene-7e4a@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11Documentation: security-bugs.rst: update preferences when dealing with the ↵Greg Kroah-Hartman1-14/+12
linux-distros group commit 4fee0915e649bd0cea56dece6d96f8f4643df33c upstream. Because the linux-distros group forces reporters to release information about reported bugs, and they impose arbitrary deadlines in having those bugs fixed despite not actually being kernel developers, the kernel security team recommends not interacting with them at all as this just causes confusion and the early-release of reported security problems. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/2023063020-throat-pantyhose-f110@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/srso: Add a Speculative RAS Overflow mitigationBorislav Petkov (AMD)3-0/+145
Upstream commit: fb3bd914b3ec28f5fb697ac55c4846ac2d542855 Add a mitigation for the speculative return address stack overflow vulnerability found on AMD processors. The mitigation works by ensuring all RET instructions speculate to a controlled location, similar to how speculation is controlled in the retpoline sequence. To accomplish this, the __x86_return_thunk forces the CPU to mispredict every function return using a 'safe return' sequence. To ensure the safety of this mitigation, the kernel must ensure that the safe return sequence is itself free from attacker interference. In Zen3 and Zen4, this is accomplished by creating a BTB alias between the untraining function srso_untrain_ret_alias() and the safe return function srso_safe_ret_alias() which results in evicting a potentially poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08Documentation/x86: Fix backwards on/off logic about YMM supportDave Hansen1-1/+1
commit 1b0fc0345f2852ffe54fb9ae0e12e2ee69ad6a20 upstream These options clearly turn *off* XSAVE YMM support. Correct the typo. Reported-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 553a5c03e90a ("x86/speculation: Add force option to GDS mitigation") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/speculation: Add force option to GDS mitigationDaniel Sneddon2-5/+21
commit 553a5c03e90a6087e88f8ff878335ef0621536fb upstream The Gather Data Sampling (GDS) vulnerability allows malicious software to infer stale data previously stored in vector registers. This may include sensitive data such as cryptographic keys. GDS is mitigated in microcode, and systems with up-to-date microcode are protected by default. However, any affected system that is running with older microcode will still be vulnerable to GDS attacks. Since the gather instructions used by the attacker are part of the AVX2 and AVX512 extensions, disabling these extensions prevents gather instructions from being executed, thereby mitigating the system from GDS. Disabling AVX2 is sufficient, but we don't have the granularity to do this. The XCR0[2] disables AVX, with no option to just disable AVX2. Add a kernel parameter gather_data_sampling=force that will enable the microcode mitigation if available, otherwise it will disable AVX on affected systems. This option will be ignored if cmdline mitigations=off. This is a *big* hammer. It is known to break buggy userspace that uses incomplete, buggy AVX enumeration. Unfortunately, such userspace does exist in the wild: https://www.mail-archive.com/bug-coreutils@gnu.org/msg33046.html [ dhansen: add some more ominous warnings about disabling AVX ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/speculation: Add Gather Data Sampling mitigationDaniel Sneddon3-10/+125
commit 8974eb588283b7d44a7c91fa09fcbaf380339f3a upstream Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%Suren Baghdasaryan1-1/+1
[ Upstream commit 39c65a94cd9661532be150e88f8b02f4a6844a35 ] For embedded systems with low total memory, having to run applications with relatively large memory requirements, 10% max limitation for watermark_scale_factor poses an issue of triggering direct reclaim every time such application is started. This results in slow application startup times and bad end-user experience. By increasing watermark_scale_factor max limit we allow vendors more flexibility to choose the right level of kswapd aggressiveness for their device and workload requirements. Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Antti Palosaari <crope@iki.fi> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Fengfei Xi <xi.fengfei@h3c.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 935d44acf621 ("memfd: check for non-NULL file_seals in memfd_create() syscall") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21Remove DECnet support from kernelStephen Hemminger2-11/+8
commit 1202cdd665315c525b5237e96e0bedc76d7e754f upstream. DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11mm: memcontrol: deprecate charge movingJohannes Weiner1-2/+11
commit da34a8484d162585e22ed8c1e4114aa2f60e3567 upstream. Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11docs: gdbmacros: print newest recordJohn Ogness1-1/+1
commit f2e4cca2f670c8e52fbb551a295f2afc9aa2bd72 upstream. @head_id points to the newest record, but the printing loop exits when it increments to this value (before printing). Exit the printing loop after the newest record has been printed. The python-based function in scripts/gdb/linux/dmesg.py already does this correctly. Fixes: e60768311af8 ("scripts/gdb: update for lockless printk ringbuffer") Cc: stable@vger.kernel.org Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221229134339.197627-1-john.ogness@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11Documentation/hw-vuln: Document the interaction between IBRS and STIBPKP Singh1-5/+16
commit e02b50ca442e88122e1302d4dbc1b71a4808c13f upstream. Explain why STIBP is needed with legacy IBRS as currently implemented (KERNEL_IBRS) and why STIBP is not needed when enhanced IBRS is enabled. Fixes: 7c693f54c873 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS") Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230227060541.1939092-2-kpsingh@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01panic: Introduce warn_limitKees Cook1-0/+10
commit 9fc9e278a5c0b708eeffaf47d6eb0c82aa74ed78 upstream. Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Allow oops_limit to be disabledKees Cook1-2/+3
commit de92f65719cd672f4b48397540b9f9eff67eca40 upstream. In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Put an upper limit on how often we can oopsJann Horn1-0/+8
commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-18iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid optionsKim Phillips1-5/+22
[ Upstream commit 1198d2316dc4265a97d0e8445a22c7a6d17580a4 ] Currently, these options cause the following libkmod error: libkmod: ERROR ../libkmod/libkmod-config.c:489 kcmdline_parse_result: \ Ignoring bad option on kernel command line while parsing module \ name: 'ivrs_xxxx[XX:XX' Fix by introducing a new parameter format for these options and throw a warning for the deprecated format. Users are still allowed to omit the PCI Segment if zero. Adding a Link: to the reason why we're modding the syntax parsing in the driver and not in libkmod. Fixes: ca3bf5d47cec ("iommu/amd: Introduces ivrs_acpihid kernel parameter") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-modules/20200310082308.14318-2-lucas.demarchi@intel.com/ Reported-by: Kim Phillips <kim.phillips@amd.com> Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20220919155638.391481-2-kim.phillips@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-18iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commandsSuravee Suthikulpanit1-9/+25
[ Upstream commit bbe3a106580c21bc883fb0c9fa3da01534392fe8 ] By default, PCI segment is zero and can be omitted. To support system with non-zero PCI segment ID, modify the parsing functions to allow PCI segment ID. Co-developed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20220706113825.25582-33-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 1198d2316dc4 ("iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid options") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31x86/bugs: Add "unknown" reporting for MMIO Stale DataPawan Gupta1-0/+14
commit 7df548840c496b0141fb2404b889c346380c2b22 upstream. Older Intel CPUs that are not in the affected processor list for MMIO Stale Data vulnerabilities currently report "Not affected" in sysfs, which may not be correct. Vulnerability status for these older CPUs is unknown. Add known-not-affected CPUs to the whitelist. Report "unknown" mitigation status for CPUs that are not in blacklist, whitelist and also don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware immunity to MMIO Stale Data vulnerabilities. Mitigation is not deployed when the status is unknown. [ bp: Massage, fixup. ] Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data") Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31net: Fix data-races around netdev_max_backlog.Kuniyuki Iwashima1-1/+1
[ Upstream commit 5dcd08cd19912892586c6082d56718333e2d19db ] While reading netdev_max_backlog, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. While at it, we remove the unnecessary spaces in the doc. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21x86/bugs: Enable STIBP for IBPB mitigated RETBleedKim Phillips1-8/+21
commit e6cfcdda8cbe81eaf821c897369a65fec987b404 upstream. AMD's "Technical Guidance for Mitigating Branch Type Confusion, Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On Privileged Mode Entry / SMT Safety" says: Similar to the Jmp2Ret mitigation, if the code on the sibling thread cannot be trusted, software should set STIBP to 1 or disable SMT to ensure SMT safety when using this mitigation. So, like already being done for retbleed=unret, and now also for retbleed=ibpb, force STIBP on machines that have it, and report its SMT vulnerability status accordingly. [ bp: Remove the "we" and remove "[AMD]" applicability parameter which doesn't work here. ] Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19 Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21x86: Handle idle=nomwait cmdline properly for x86_idleWyes Karny1-6/+9
[ Upstream commit 8bcedb4ce04750e1ccc9a6b6433387f6a9166a56 ] When kernel is booted with idle=nomwait do not use MWAIT as the default idle state. If the user boots the kernel with idle=nomwait, it is a clear direction to not use mwait as the default idle state. However, the current code does not take this into consideration while selecting the default idle state on x86. Fix it by checking for the idle=nomwait boot option in prefer_mwait_c1_over_halt(). Also update the documentation around idle=nomwait appropriately. [ dhansen: tweak commit message ] Signed-off-by: Wyes Karny <wyes.karny@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lkml.kernel.org/r/fdc2dc2d0a1bc21c2f53d989ea2d2ee3ccbc0dbe.1654538381.git-series.wyes.karny@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-11x86/speculation: Add RSB VM Exit protectionsDaniel Sneddon1-0/+8
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream. tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE. == Background == Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change. To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests. == Problem == Here's a simplification of how guests are run on Linux' KVM: void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs } The execution flow for that would look something like this to the processor: 1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest() Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code: * on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing. * on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand". IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction. However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address. Balanced CALL/RET instruction pairs such as in step #5 are not affected. == Solution == The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly. However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation. Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT. The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE. In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE. There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO. [ bp: Massage, incorporate review comments from Andy Cooper. ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03docs/kernel-parameters: Update descriptions for "mitigations=" param with ↵Eiichi Tsukata1-0/+2
retbleed commit ea304a8b89fd0d6cf94ee30cb139dc23d9f1a62f upstream. Updates descriptions for "mitigations=off" and "mitigations=auto,nosmt" with the respective retbleed= settings. Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: corbet@lwn.net Link: https://lore.kernel.org/r/20220728043907.165688-1-eiichi.tsukata@nutanix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Add retbleed=ibpbPeter Zijlstra1-0/+3
commit 3ebc170068885b6fc7bedda6c667bb2c4d533159 upstream. jmp2ret mitigates the easy-to-attack case at relatively low overhead. It mitigates the long speculation windows after a mispredicted RET, but it does not mitigate the short speculation window from arbitrary instruction boundaries. On Zen2, there is a chicken bit which needs setting, which mitigates "arbitrary instruction boundaries" down to just "basic block boundaries". But there is no fix for the short speculation window on basic block boundaries, other than to flush the entire BTB to evict all attacker predictions. On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP or no-SMT): 1) Nothing System wide open 2) jmp2ret May stop a script kiddy 3) jmp2ret+chickenbit Raises the bar rather further 4) IBPB Only thing which can count as "safe". Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit on Zen1 according to lmbench. [ bp: Fixup feature bit comments, document option, 32-bit build fix. ] Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> [bwh: Backported to 5.10: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRSPawan Gupta1-0/+1
commit 7c693f54c873691a4b7da05c7e0f74e67745d144 upstream. Extend spectre_v2= boot option with Kernel IBRS. [jpoimboe: no STIBP with IBRS] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Enable STIBP for JMP2RETKim Phillips1-5/+11
commit e8ec1b6e08a2102d8755ccb06fa26d540f26a2fa upstream. For untrained return thunks to be fully effective, STIBP must be enabled or SMT disabled. Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Add AMD retbleed= boot parameterAlexandre Chartre1-0/+15
commit 7fbf47c7ce50b38a64576b150e7011ae73d54669 upstream. Add the "retbleed=<value>" boot parameter to select a mitigation for RETBleed. Possible values are "off", "auto" and "unret" (JMP2RET mitigation). The default value is "auto". Currently, "retbleed=auto" will select the unret mitigation on AMD and Hygon and no mitigation on Intel (JMP2RET is not effective on Intel). [peterz: rebase; add hygon] [jpoimboe: cleanups] Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-16x86/speculation/mmio: Add mitigation for Processor MMIO Stale DataPawan Gupta1-0/+36
commit 8cb861e9e3c9a55099ad3d08e1a3b653d29c33ca upstream Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-16Documentation: Add documentation for Processor MMIO Stale DataPawan Gupta2-0/+247
commit 4419470191386456e0b8ed4eb06a70b0021798a6 upstream Add the admin guide for Processor MMIO stale data vulnerabilities. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: fix sysctl documentation nitsJason A. Donenfeld1-4/+4
commit 069c4ea6871c18bd368f27756e0f91ffb524a788 upstream. A semicolon was missing, and the almost-alphabetical-but-not ordering was confusing, so regroup these by category instead. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: treat bootloader trust toggle the same way as cpu trust toggleJason A. Donenfeld1-0/+6
commit d97c68d178fbf8aaaf21b69b446f2dfb13909316 upstream. If CONFIG_RANDOM_TRUST_CPU is set, the RNG initializes using RDRAND. But, the user can disable (or enable) this behavior by setting `random.trust_cpu=0/1` on the kernel command line. This allows system builders to do reasonable things while avoiding howls from tinfoil hatters. (Or vice versa.) CONFIG_RANDOM_TRUST_BOOTLOADER is basically the same thing, but regards the seed passed via EFI or device tree, which might come from RDRAND or a TPM or somewhere else. In order to allow distros to more easily enable this while avoiding those same howls (or vice versa), this commit adds the corresponding `random.trust_bootloader=0/1` toggle. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Graham Christensen <graham@grahamc.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Link: https://github.com/NixOS/nixpkgs/pull/165355 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>