summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
AgeCommit message (Collapse)AuthorFilesLines
2023-10-25Documentation: sysctl: align cells in second content columnBagas Sanjaya1-9/+9
commit 1faa34672f8a17a3e155e74bde9648564e9480d6 upstream. Stephen Rothwell reported htmldocs warning when merging net-next tree: Documentation/admin-guide/sysctl/net.rst:37: WARNING: Malformed table. Text in column margin in table line 4. ========= =================== = ========== ================== Directory Content Directory Content ========= =================== = ========== ================== 802 E802 protocol mptcp Multipath TCP appletalk Appletalk protocol netfilter Network Filter ax25 AX25 netrom NET/ROM bridge Bridging rose X.25 PLP layer core General parameter tipc TIPC ethernet Ethernet protocol unix Unix domain sockets ipv4 IP version 4 x25 X.25 protocol ipv6 IP version 6 ========= =================== = ========== ================== The warning above is caused by cells in second "Content" column of /proc/sys/net subdirectory table which are in column margin. Align these cells against the column header to fix the warning. Link: https://lore.kernel.org/linux-next/20220823134905.57ed08d5@canb.auug.org.au/ Fixes: 1202cdd665315c ("Remove DECnet support from kernel") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Link: https://lore.kernel.org/r/20220824035804.204322-1-bagasdotme@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-26x86/cpu: Rename srso_(.*)_alias to srso_alias_\1Peter Zijlstra1-2/+2
commit 42be649dd1f2eee6b1fb185f1a231b9494cf095f upstream. For a more consistent namespace. [ bp: Fixup names in the doc too. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230814121148.976236447@infradead.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11Documentation: security-bugs.rst: clarify CVE handlingGreg Kroah-Hartman1-7/+6
commit 3c1897ae4b6bc7cc586eda2feaa2cd68325ec29c upstream. The kernel security team does NOT assign CVEs, so document that properly and provide the "if you want one, ask MITRE for it" response that we give on a weekly basis in the document, so we don't have to constantly say it to everyone who asks. Link: https://lore.kernel.org/r/2023063022-retouch-kerosene-7e4a@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-11Documentation: security-bugs.rst: update preferences when dealing with the ↵Greg Kroah-Hartman1-14/+12
linux-distros group commit 4fee0915e649bd0cea56dece6d96f8f4643df33c upstream. Because the linux-distros group forces reporters to release information about reported bugs, and they impose arbitrary deadlines in having those bugs fixed despite not actually being kernel developers, the kernel security team recommends not interacting with them at all as this just causes confusion and the early-release of reported security problems. Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/2023063020-throat-pantyhose-f110@gregkh Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/srso: Add a Speculative RAS Overflow mitigationBorislav Petkov (AMD)3-0/+145
Upstream commit: fb3bd914b3ec28f5fb697ac55c4846ac2d542855 Add a mitigation for the speculative return address stack overflow vulnerability found on AMD processors. The mitigation works by ensuring all RET instructions speculate to a controlled location, similar to how speculation is controlled in the retpoline sequence. To accomplish this, the __x86_return_thunk forces the CPU to mispredict every function return using a 'safe return' sequence. To ensure the safety of this mitigation, the kernel must ensure that the safe return sequence is itself free from attacker interference. In Zen3 and Zen4, this is accomplished by creating a BTB alias between the untraining function srso_untrain_ret_alias() and the safe return function srso_safe_ret_alias() which results in evicting a potentially poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08Documentation/x86: Fix backwards on/off logic about YMM supportDave Hansen1-1/+1
commit 1b0fc0345f2852ffe54fb9ae0e12e2ee69ad6a20 upstream These options clearly turn *off* XSAVE YMM support. Correct the typo. Reported-by: Ben Hutchings <ben@decadent.org.uk> Fixes: 553a5c03e90a ("x86/speculation: Add force option to GDS mitigation") Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/speculation: Add force option to GDS mitigationDaniel Sneddon2-5/+21
commit 553a5c03e90a6087e88f8ff878335ef0621536fb upstream The Gather Data Sampling (GDS) vulnerability allows malicious software to infer stale data previously stored in vector registers. This may include sensitive data such as cryptographic keys. GDS is mitigated in microcode, and systems with up-to-date microcode are protected by default. However, any affected system that is running with older microcode will still be vulnerable to GDS attacks. Since the gather instructions used by the attacker are part of the AVX2 and AVX512 extensions, disabling these extensions prevents gather instructions from being executed, thereby mitigating the system from GDS. Disabling AVX2 is sufficient, but we don't have the granularity to do this. The XCR0[2] disables AVX, with no option to just disable AVX2. Add a kernel parameter gather_data_sampling=force that will enable the microcode mitigation if available, otherwise it will disable AVX on affected systems. This option will be ignored if cmdline mitigations=off. This is a *big* hammer. It is known to break buggy userspace that uses incomplete, buggy AVX enumeration. Unfortunately, such userspace does exist in the wild: https://www.mail-archive.com/bug-coreutils@gnu.org/msg33046.html [ dhansen: add some more ominous warnings about disabling AVX ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-08x86/speculation: Add Gather Data Sampling mitigationDaniel Sneddon3-10/+125
commit 8974eb588283b7d44a7c91fa09fcbaf380339f3a upstream Gather Data Sampling (GDS) is a hardware vulnerability which allows unprivileged speculative access to data which was previously stored in vector registers. Intel processors that support AVX2 and AVX512 have gather instructions that fetch non-contiguous data elements from memory. On vulnerable hardware, when a gather instruction is transiently executed and encounters a fault, stale data from architectural or internal vector registers may get transiently stored to the destination vector register allowing an attacker to infer the stale data using typical side channel techniques like cache timing attacks. This mitigation is different from many earlier ones for two reasons. First, it is enabled by default and a bit must be set to *DISABLE* it. This is the opposite of normal mitigation polarity. This means GDS can be mitigated simply by updating microcode and leaving the new control bit alone. Second, GDS has a "lock" bit. This lock bit is there because the mitigation affects the hardware security features KeyLocker and SGX. It needs to be enabled and *STAY* enabled for these features to be mitigated against GDS. The mitigation is enabled in the microcode by default. Disable it by setting gather_data_sampling=off or by disabling all mitigations with mitigations=off. The mitigation status can be checked by reading: /sys/devices/system/cpu/vulnerabilities/gather_data_sampling Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-28mm/pagealloc: sysctl: change watermark_scale_factor max limit to 30%Suren Baghdasaryan1-1/+1
[ Upstream commit 39c65a94cd9661532be150e88f8b02f4a6844a35 ] For embedded systems with low total memory, having to run applications with relatively large memory requirements, 10% max limitation for watermark_scale_factor poses an issue of triggering direct reclaim every time such application is started. This results in slow application startup times and bad end-user experience. By increasing watermark_scale_factor max limit we allow vendors more flexibility to choose the right level of kswapd aggressiveness for their device and workload requirements. Link: https://lkml.kernel.org/r/20211124193604.2758863-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Antti Palosaari <crope@iki.fi> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Zhang Yi <yi.zhang@huawei.com> Cc: Fengfei Xi <xi.fengfei@h3c.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Stable-dep-of: 935d44acf621 ("memfd: check for non-NULL file_seals in memfd_create() syscall") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21Remove DECnet support from kernelStephen Hemminger2-11/+8
commit 1202cdd665315c525b5237e96e0bedc76d7e754f upstream. DECnet is an obsolete network protocol that receives more attention from kernel janitors than users. It belongs in computer protocol history museum not in Linux kernel. It has been "Orphaned" in kernel since 2010. The iproute2 support for DECnet was dropped in 5.0 release. The documentation link on Sourceforge says it is abandoned there as well. Leave the UAPI alone to keep userspace programs compiling. This means that there is still an empty neighbour table for AF_DECNET. The table of /proc/sys/net entries was updated to match current directories and reformatted to be alphabetical. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: David Ahern <dsahern@kernel.org> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11mm: memcontrol: deprecate charge movingJohannes Weiner1-2/+11
commit da34a8484d162585e22ed8c1e4114aa2f60e3567 upstream. Charge moving mode in cgroup1 allows memory to follow tasks as they migrate between cgroups. This is, and always has been, a questionable thing to do - for several reasons. First, it's expensive. Pages need to be identified, locked and isolated from various MM operations, and reassigned, one by one. Second, it's unreliable. Once pages are charged to a cgroup, there isn't always a clear owner task anymore. Cache isn't moved at all, for example. Mapped memory is moved - but if trylocking or isolating a page fails, it's arbitrarily left behind. Frequent moving between domains may leave a task's memory scattered all over the place. Third, it isn't really needed. Launcher tasks can kick off workload tasks directly in their target cgroup. Using dedicated per-workload groups allows fine-grained policy adjustments - no need to move tasks and their physical pages between control domains. The feature was never forward-ported to cgroup2, and it hasn't been missed. Despite it being a niche usecase, the maintenance overhead of supporting it is enormous. Because pages are moved while they are live and subject to various MM operations, the synchronization rules are complicated. There are lock_page_memcg() in MM and FS code, which non-cgroup people don't understand. In some cases we've been able to shift code and cgroup API calls around such that we can rely on native locking as much as possible. But that's fragile, and sometimes we need to hold MM locks for longer than we otherwise would (pte lock e.g.). Mark the feature deprecated. Hopefully we can remove it soon. And backport into -stable kernels so that people who develop against earlier kernels are warned about this deprecation as early as possible. [akpm@linux-foundation.org: fix memory.rst underlining] Link: https://lkml.kernel.org/r/Y5COd+qXwk/S+n8N@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11docs: gdbmacros: print newest recordJohn Ogness1-1/+1
commit f2e4cca2f670c8e52fbb551a295f2afc9aa2bd72 upstream. @head_id points to the newest record, but the printing loop exits when it increments to this value (before printing). Exit the printing loop after the newest record has been printed. The python-based function in scripts/gdb/linux/dmesg.py already does this correctly. Fixes: e60768311af8 ("scripts/gdb: update for lockless printk ringbuffer") Cc: stable@vger.kernel.org Signed-off-by: John Ogness <john.ogness@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20221229134339.197627-1-john.ogness@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-03-11Documentation/hw-vuln: Document the interaction between IBRS and STIBPKP Singh1-5/+16
commit e02b50ca442e88122e1302d4dbc1b71a4808c13f upstream. Explain why STIBP is needed with legacy IBRS as currently implemented (KERNEL_IBRS) and why STIBP is not needed when enhanced IBRS is enabled. Fixes: 7c693f54c873 ("x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS") Signed-off-by: KP Singh <kpsingh@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20230227060541.1939092-2-kpsingh@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-02-01panic: Introduce warn_limitKees Cook1-0/+10
commit 9fc9e278a5c0b708eeffaf47d6eb0c82aa74ed78 upstream. Like oops_limit, add warn_limit for limiting the number of warnings when panic_on_warn is not set. Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: tangmeng <tangmeng@uniontech.com> Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com> Cc: Tiezhu Yang <yangtiezhu@loongson.cn> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: linux-doc@vger.kernel.org Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Allow oops_limit to be disabledKees Cook1-2/+3
commit de92f65719cd672f4b48397540b9f9eff67eca40 upstream. In preparation for keeping oops_limit logic in sync with warn_limit, have oops_limit == 0 disable checking the Oops counter. Cc: Jann Horn <jannh@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: linux-doc@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-02-01exit: Put an upper limit on how often we can oopsJann Horn1-0/+8
commit d4ccd54d28d3c8598e2354acc13e28c060961dbb upstream. Many Linux systems are configured to not panic on oops; but allowing an attacker to oops the system **really** often can make even bugs that look completely unexploitable exploitable (like NULL dereferences and such) if each crash elevates a refcount by one or a lock is taken in read mode, and this causes a counter to eventually overflow. The most interesting counters for this are 32 bits wide (like open-coded refcounts that don't use refcount_t). (The ldsem reader count on 32-bit platforms is just 16 bits, but probably nobody cares about 32-bit platforms that much nowadays.) So let's panic the system if the kernel is constantly oopsing. The speed of oopsing 2^32 times probably depends on several factors, like how long the stack trace is and which unwinder you're using; an empirically important one is whether your console is showing a graphical environment or a text console that oopses will be printed to. In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run. (Adding more threads makes this faster, but the actual oops printing happens under &die_lock on x86, so you can maybe speed this up by a factor of around 2 and then any further improvement gets eaten up by lock contention.) It looks like it would take around 8-12 days to overflow a 32-bit counter with repeated oopsing on a multi-core X86 system running a graphical environment; both me (in an X86 VM) and Seth (with a distro kernel on normal hardware in a standard configuration) got numbers in that ballpark. 12 days aren't *that* short on a desktop system, and you'd likely need much longer on a typical server system (assuming that people don't run graphical desktop environments on their servers), and this is a *very* noisy and violent approach to exploiting the kernel; and it also seems to take orders of magnitude longer on some machines, probably because stuff like EFI pstore will slow it down a ton if that's active. Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.com Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-18iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid optionsKim Phillips1-5/+22
[ Upstream commit 1198d2316dc4265a97d0e8445a22c7a6d17580a4 ] Currently, these options cause the following libkmod error: libkmod: ERROR ../libkmod/libkmod-config.c:489 kcmdline_parse_result: \ Ignoring bad option on kernel command line while parsing module \ name: 'ivrs_xxxx[XX:XX' Fix by introducing a new parameter format for these options and throw a warning for the deprecated format. Users are still allowed to omit the PCI Segment if zero. Adding a Link: to the reason why we're modding the syntax parsing in the driver and not in libkmod. Fixes: ca3bf5d47cec ("iommu/amd: Introduces ivrs_acpihid kernel parameter") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-modules/20200310082308.14318-2-lucas.demarchi@intel.com/ Reported-by: Kim Phillips <kim.phillips@amd.com> Co-developed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Link: https://lore.kernel.org/r/20220919155638.391481-2-kim.phillips@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-01-18iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commandsSuravee Suthikulpanit1-9/+25
[ Upstream commit bbe3a106580c21bc883fb0c9fa3da01534392fe8 ] By default, PCI segment is zero and can be omitted. To support system with non-zero PCI segment ID, modify the parsing functions to allow PCI segment ID. Co-developed-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Link: https://lore.kernel.org/r/20220706113825.25582-33-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de> Stable-dep-of: 1198d2316dc4 ("iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid options") Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-31x86/bugs: Add "unknown" reporting for MMIO Stale DataPawan Gupta1-0/+14
commit 7df548840c496b0141fb2404b889c346380c2b22 upstream. Older Intel CPUs that are not in the affected processor list for MMIO Stale Data vulnerabilities currently report "Not affected" in sysfs, which may not be correct. Vulnerability status for these older CPUs is unknown. Add known-not-affected CPUs to the whitelist. Report "unknown" mitigation status for CPUs that are not in blacklist, whitelist and also don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware immunity to MMIO Stale Data vulnerabilities. Mitigation is not deployed when the status is unknown. [ bp: Massage, fixup. ] Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data") Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-31net: Fix data-races around netdev_max_backlog.Kuniyuki Iwashima1-1/+1
[ Upstream commit 5dcd08cd19912892586c6082d56718333e2d19db ] While reading netdev_max_backlog, it can be changed concurrently. Thus, we need to add READ_ONCE() to its readers. While at it, we remove the unnecessary spaces in the doc. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-21x86/bugs: Enable STIBP for IBPB mitigated RETBleedKim Phillips1-8/+21
commit e6cfcdda8cbe81eaf821c897369a65fec987b404 upstream. AMD's "Technical Guidance for Mitigating Branch Type Confusion, Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On Privileged Mode Entry / SMT Safety" says: Similar to the Jmp2Ret mitigation, if the code on the sibling thread cannot be trusted, software should set STIBP to 1 or disable SMT to ensure SMT safety when using this mitigation. So, like already being done for retbleed=unret, and now also for retbleed=ibpb, force STIBP on machines that have it, and report its SMT vulnerability status accordingly. [ bp: Remove the "we" and remove "[AMD]" applicability parameter which doesn't work here. ] Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19 Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-21x86: Handle idle=nomwait cmdline properly for x86_idleWyes Karny1-6/+9
[ Upstream commit 8bcedb4ce04750e1ccc9a6b6433387f6a9166a56 ] When kernel is booted with idle=nomwait do not use MWAIT as the default idle state. If the user boots the kernel with idle=nomwait, it is a clear direction to not use mwait as the default idle state. However, the current code does not take this into consideration while selecting the default idle state on x86. Fix it by checking for the idle=nomwait boot option in prefer_mwait_c1_over_halt(). Also update the documentation around idle=nomwait appropriately. [ dhansen: tweak commit message ] Signed-off-by: Wyes Karny <wyes.karny@amd.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Zhang Rui <rui.zhang@intel.com> Link: https://lkml.kernel.org/r/fdc2dc2d0a1bc21c2f53d989ea2d2ee3ccbc0dbe.1654538381.git-series.wyes.karny@amd.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2022-08-11x86/speculation: Add RSB VM Exit protectionsDaniel Sneddon1-0/+8
commit 2b1299322016731d56807aa49254a5ea3080b6b3 upstream. tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE. == Background == Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change. To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests. == Problem == Here's a simplification of how guests are run on Linux' KVM: void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs } The execution flow for that would look something like this to the processor: 1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest() Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code: * on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing. * on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand". IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction. However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address. Balanced CALL/RET instruction pairs such as in step #5 are not affected. == Solution == The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly. However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation. Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT. The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE. In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE. There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO. [ bp: Massage, incorporate review comments from Andy Cooper. ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-08-03docs/kernel-parameters: Update descriptions for "mitigations=" param with ↵Eiichi Tsukata1-0/+2
retbleed commit ea304a8b89fd0d6cf94ee30cb139dc23d9f1a62f upstream. Updates descriptions for "mitigations=off" and "mitigations=auto,nosmt" with the respective retbleed= settings. Signed-off-by: Eiichi Tsukata <eiichi.tsukata@nutanix.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: corbet@lwn.net Link: https://lore.kernel.org/r/20220728043907.165688-1-eiichi.tsukata@nutanix.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Add retbleed=ibpbPeter Zijlstra1-0/+3
commit 3ebc170068885b6fc7bedda6c667bb2c4d533159 upstream. jmp2ret mitigates the easy-to-attack case at relatively low overhead. It mitigates the long speculation windows after a mispredicted RET, but it does not mitigate the short speculation window from arbitrary instruction boundaries. On Zen2, there is a chicken bit which needs setting, which mitigates "arbitrary instruction boundaries" down to just "basic block boundaries". But there is no fix for the short speculation window on basic block boundaries, other than to flush the entire BTB to evict all attacker predictions. On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP or no-SMT): 1) Nothing System wide open 2) jmp2ret May stop a script kiddy 3) jmp2ret+chickenbit Raises the bar rather further 4) IBPB Only thing which can count as "safe". Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit on Zen1 according to lmbench. [ bp: Fixup feature bit comments, document option, 32-bit build fix. ] Suggested-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> [bwh: Backported to 5.10: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRSPawan Gupta1-0/+1
commit 7c693f54c873691a4b7da05c7e0f74e67745d144 upstream. Extend spectre_v2= boot option with Kernel IBRS. [jpoimboe: no STIBP with IBRS] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Enable STIBP for JMP2RETKim Phillips1-5/+11
commit e8ec1b6e08a2102d8755ccb06fa26d540f26a2fa upstream. For untrained return thunks to be fully effective, STIBP must be enabled or SMT disabled. Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-07-25x86/bugs: Add AMD retbleed= boot parameterAlexandre Chartre1-0/+15
commit 7fbf47c7ce50b38a64576b150e7011ae73d54669 upstream. Add the "retbleed=<value>" boot parameter to select a mitigation for RETBleed. Possible values are "off", "auto" and "unret" (JMP2RET mitigation). The default value is "auto". Currently, "retbleed=auto" will select the unret mitigation on AMD and Hygon and no mitigation on Intel (JMP2RET is not effective on Intel). [peterz: rebase; add hygon] [jpoimboe: cleanups] Signed-off-by: Alexandre Chartre <alexandre.chartre@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-16x86/speculation/mmio: Add mitigation for Processor MMIO Stale DataPawan Gupta1-0/+36
commit 8cb861e9e3c9a55099ad3d08e1a3b653d29c33ca upstream Processor MMIO Stale Data is a class of vulnerabilities that may expose data after an MMIO operation. For details please refer to Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst. These vulnerabilities are broadly categorized as: Device Register Partial Write (DRPW): Some endpoint MMIO registers incorrectly handle writes that are smaller than the register size. Instead of aborting the write or only copying the correct subset of bytes (for example, 2 bytes for a 2-byte write), more bytes than specified by the write transaction may be written to the register. On some processors, this may expose stale data from the fill buffers of the core that created the write transaction. Shared Buffers Data Sampling (SBDS): After propagators may have moved data around the uncore and copied stale data into client core fill buffers, processors affected by MFBDS can leak data from the fill buffer. Shared Buffers Data Read (SBDR): It is similar to Shared Buffer Data Sampling (SBDS) except that the data is directly read into the architectural software-visible state. An attacker can use these vulnerabilities to extract data from CPU fill buffers using MDS and TAA methods. Mitigate it by clearing the CPU fill buffers using the VERW instruction before returning to a user or a guest. On CPUs not affected by MDS and TAA, user application cannot sample data from CPU fill buffers using MDS or TAA. A guest with MMIO access can still use DRPW or SBDR to extract data architecturally. Mitigate it with VERW instruction to clear fill buffers before VMENTER for MMIO capable guests. Add a kernel parameter mmio_stale_data={off|full|full,nosmt} to control the mitigation. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-06-16Documentation: Add documentation for Processor MMIO Stale DataPawan Gupta2-0/+247
commit 4419470191386456e0b8ed4eb06a70b0021798a6 upstream Add the admin guide for Processor MMIO stale data vulnerabilities. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: fix sysctl documentation nitsJason A. Donenfeld1-4/+4
commit 069c4ea6871c18bd368f27756e0f91ffb524a788 upstream. A semicolon was missing, and the almost-alphabetical-but-not ordering was confusing, so regroup these by category instead. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: treat bootloader trust toggle the same way as cpu trust toggleJason A. Donenfeld1-0/+6
commit d97c68d178fbf8aaaf21b69b446f2dfb13909316 upstream. If CONFIG_RANDOM_TRUST_CPU is set, the RNG initializes using RDRAND. But, the user can disable (or enable) this behavior by setting `random.trust_cpu=0/1` on the kernel command line. This allows system builders to do reasonable things while avoiding howls from tinfoil hatters. (Or vice versa.) CONFIG_RANDOM_TRUST_BOOTLOADER is basically the same thing, but regards the seed passed via EFI or device tree, which might come from RDRAND or a TPM or somewhere else. In order to allow distros to more easily enable this while avoiding those same howls (or vice versa), this commit adds the corresponding `random.trust_bootloader=0/1` toggle. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Graham Christensen <graham@grahamc.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Link: https://github.com/NixOS/nixpkgs/pull/165355 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: remove ifdef'd out interrupt benchJason A. Donenfeld1-9/+0
commit 95e6060c20a7f5db60163274c5222a725ac118f9 upstream. With tools like kbench9000 giving more finegrained responses, and this basically never having been used ever since it was initially added, let's just get rid of this. There *is* still work to be done on the interrupt handler, but this really isn't the way it's being developed. Cc: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-05-30random: always wake up entropy writers after extractionJason A. Donenfeld1-2/+5
commit 489c7fc44b5740d377e8cfdbf0851036e493af00 upstream. Now that POOL_BITS == POOL_MIN_BITS, we must unconditionally wake up entropy writers after every extraction. Therefore there's no point of write_wakeup_threshold, so we can move it to the dustbin of unused compatibility sysctls. While we're at it, we can fix a small comparison where we were waking up after <= min rather than < min. Cc: Theodore Ts'o <tytso@mit.edu> Suggested-by: Eric Biggers <ebiggers@kernel.org> Reviewed-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Dominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-04-08docs: sysctl/kernel: add missing bit to panic_printGuilherme G. Piccoli1-0/+1
commit a1ff1de00db21ecb956213f046b79741b64c6b65 upstream. Patch series "Some improvements on panic_print". This is a mix of a documentation fix with some additions to the "panic_print" syscall / parameter. The goal here is being able to collect all CPUs backtraces during a panic event and also to enable "panic_print" in a kdump event - details of the reasoning and design choices in the patches. This patch (of 3): Commit de6da1e8bcf0 ("panic: add an option to replay all the printk message in buffer") added a new bit to the sysctl/kernel parameter "panic_print", but the documentation was added only in kernel-parameters.txt, not in the sysctl guide. Fix it here by adding bit 5 to sysctl admin-guide documentation. [rdunlap@infradead.org: fix table format warning] Link: https://lkml.kernel.org/r/20220109055635.6999-1-rdunlap@infradead.org Link: https://lkml.kernel.org/r/20211109202848.610874-1-gpiccoli@igalia.com Link: https://lkml.kernel.org/r/20211109202848.610874-2-gpiccoli@igalia.com Fixes: de6da1e8bcf0 ("panic: add an option to replay all the printk message in buffer") Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Reviewed-by: Feng Tang <feng.tang@intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Samuel Iglesias Gonsalvez <siglesias@igalia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11x86/speculation: Update link to AMD speculation whitepaperKim Phillips1-3/+3
commit e9b6013a7ce31535b04b02ba99babefe8a8599fa upstream. Update the link to the "Software Techniques for Managing Speculation on AMD Processors" whitepaper. Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-03-11Documentation/hw-vuln: Update spectre docPeter Zijlstra2-15/+35
commit 5ad3eb1132453b9795ce5fd4572b1c18b292cca9 upstream. Update the doc with the new fun. [ bp: Massage commit message. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> [fllinden@amazon.com: backported to 5.10] Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-27Documentation: refer to config RANDOMIZE_BASE for kernel address-space ↵Lukas Bulwahn1-1/+1
randomization commit 82ca67321f55a8d1da6ac3ed611da3c32818bb37 upstream. The config RANDOMIZE_SLAB does not exist, the authors probably intended to refer to the config RANDOMIZE_BASE, which provides kernel address-space randomization. They probably just confused SLAB with BASE (these two four-letter words coincidentally share three common letters), as they also point out the config SLAB_FREELIST_RANDOM as further randomization within the same sentence. Fix the reference of the config for kernel address-space randomization to the config that provides that. Fixes: 6e88559470f5 ("Documentation: Add section about CPU vulnerabilities for Spectre") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Link: https://lore.kernel.org/r/20211230171940.27558-1-lukas.bulwahn@gmail.com Signed-off-by: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-05bpf: Add kconfig knob for disabling unpriv bpf by defaultDaniel Borkmann1-3/+14
commit 08389d888287c3823f80b0216766b71e17f0aba5 upstream. Add a kconfig knob which allows for unprivileged bpf to be disabled by default. If set, the knob sets /proc/sys/kernel/unprivileged_bpf_disabled to value of 2. This still allows a transition of 2 -> {0,1} through an admin. Similarly, this also still keeps 1 -> {1} behavior intact, so that once set to permanently disabled, it cannot be undone aside from a reboot. We've also added extra2 with max of 2 for the procfs handler, so that an admin still has a chance to toggle between 0 <-> 2. Either way, as an additional alternative, applications can make use of CAP_BPF that we added a while ago. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/74ec548079189e4e4dffaeb42b8987bb3c852eee.1620765074.git.daniel@iogearbox.net Cc: Salvatore Bonaccorso <carnil@debian.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-01-05Input: i8042 - add deferred probe supportTakashi Iwai1-0/+2
[ Upstream commit 9222ba68c3f4065f6364b99cc641b6b019ef2d42 ] We've got a bug report about the non-working keyboard on ASUS ZenBook UX425UA. It seems that the PS/2 device isn't ready immediately at boot but takes some seconds to get ready. Until now, the only workaround is to defer the probe, but it's available only when the driver is a module. However, many distros, including openSUSE as in the original report, build the PS/2 input drivers into kernel, hence it won't work easily. This patch adds the support for the deferred probe for i8042 stuff as a workaround of the problem above. When the deferred probe mode is enabled and the device couldn't be probed, it'll be repeated with the standard deferred probe mechanism. The deferred probe mode is enabled either via the new option i8042.probe_defer or via the quirk table entry. As of this patch, the quirk table contains only ASUS ZenBook UX425UA. The deferred probe part is based on Fabio's initial work. BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1190256 Signed-off-by: Takashi Iwai <tiwai@suse.de> Tested-by: Samuel Čavoj <samuel@cavoj.net> Link: https://lore.kernel.org/r/20211117063757.11380-1-tiwai@suse.de Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-12-29KVM: VMX: Fix stale docs for kvm-intel.emulate_invalid_guest_stateSean Christopherson1-2/+6
commit 0ff29701ffad9a5d5a24344d8b09f3af7b96ffda upstream. Update the documentation for kvm-intel's emulate_invalid_guest_state to rectify the description of KVM's default behavior, and to document that the behavior and thus parameter only applies to L1. Fixes: a27685c33acc ("KVM: VMX: Emulate invalid guest state by default") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20211207193006.120997-4-seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-11-18xen/balloon: add late_initcall_sync() for initial ballooning doneJuergen Gross1-0/+7
commit 40fdea0284bb20814399da0484a658a96c735d90 upstream. When running as PVH or HVM guest with actual memory < max memory the hypervisor is using "populate on demand" in order to allow the guest to balloon down from its maximum memory size. For this to work correctly the guest must not touch more memory pages than its target memory size as otherwise the PoD cache will be exhausted and the guest is crashed as a result of that. In extreme cases ballooning down might not be finished today before the init process is started, which can consume lots of memory. In order to avoid random boot crashes in such cases, add a late init call to wait for ballooning down having finished for PVH/HVM guests. Warn on console if initial ballooning fails, panic() after stalling for more than 3 minutes per default. Add a module parameter for changing this timeout. [boris: replaced pr_info() with pr_notice()] Cc: <stable@vger.kernel.org> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20211102091944.17487-1-jgross@suse.com Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-18docs: Fix infiniband uverbs minor numberLeon Romanovsky1-3/+3
[ Upstream commit 8d7e415d55610d503fdb8815344846b72d194a40 ] Starting from the beginning of infiniband subsystem, the uverbs char devices start from 192 as a minor number, see commit bc38a6abdd5a ("[PATCH] IB uverbs: core implementation"). This patch updates the admin guide documentation to reflect it. Fixes: 9d85025b0418 ("docs-rst: create an user's manual book") Link: https://lore.kernel.org/r/bad03e6bcde45550c01e12908a6fe7dfa4770703.1627477347.git.leonro@nvidia.com Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-07-14clocksource: Retry clock read if long delays detectedPaul E. McKenney1-0/+6
[ Upstream commit db3a34e17433de2390eb80d436970edcebd0ca3e ] When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. Therefore, re-read the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_cswd_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If more than one retry was required, a message is printed on the console (the occasional single retry is expected behavior, especially in guest OSes). If the maximum number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Reported-by: Chris Mason <clm@fb.com> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Feng Tang <feng.tang@intel.com> Link: https://lore.kernel.org/r/20210527190124.440372-1-paulmck@kernel.org Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-17drivers/base/memory: don't store phys_device in memory blocksDavid Hildenbrand1-2/+2
[ Upstream commit e9a2e48e8704c9d20a625c6f2357147d03ea7b97 ] No need to store the value for each and every memory block, as we can easily query the value at runtime. Reshuffle the members to optimize the memory layout. Also, let's clarify what the interface once was used for and why it's legacy nowadays. "phys_device" was used on s390x in older versions of lsmem[2]/chmem[3], back when they were still part of s390x-tools. They were later replaced by the variants in linux-utils. For example, RHEL6 and RHEL7 contain lsmem/chmem from s390-utils. RHEL8 switched to versions from util-linux on s390x [4]. "phys_device" was added with sysfs support for memory hotplug in commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove functions") in 2005. It always returned 0. s390x started returning something != 0 on some setups (if sclp.rzm is set by HW) in 2010 via commit 57b552ba0b2f ("memory hotplug/s390: set phys_device"). For s390x, it allowed for identifying which memory block devices belong to the same storage increment (RZM). Only if all memory block devices comprising a single storage increment were offline, the memory could actually be removed in the hypervisor. Since commit e5d709bb5fb7 ("s390/memory hotplug: provide memory_block_size_bytes() function") in 2013 a memory block device spans at least one storage increment - which is why the interface isn't really helpful/used anymore (except by old lsmem/chmem tools). There were once RFC patches to make use of "phys_device" in ACPI context; however, the underlying problem could be solved using different interfaces [1]. [1] https://patchwork.kernel.org/patch/2163871/ [2] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/lsmem [3] https://github.com/ibm-s390-tools/s390-tools/blob/v2.1.0/zconf/chmem [4] https://bugzilla.redhat.com/show_bug.cgi?id=1504134 Link: https://lkml.kernel.org/r/20210201181347.13262-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Vaibhav Jain <vaibhav@linux.ibm.com> Cc: Tom Rix <trix@redhat.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-03-04mm/vmscan: restore zone_reclaim_mode ABIDave Hansen1-5/+5
commit 519983645a9f2ec339cabfa0c6ef7b09be985dd0 upstream. I went to go add a new RECLAIM_* mode for the zone_reclaim_mode sysctl. Like a good kernel developer, I also went to go update the documentation. I noticed that the bits in the documentation didn't match the bits in the #defines. The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, however implicitly checked when checking 'node_reclaim_mode==0'. The RECLAIM_ZONE #define was removed in a cleanup. That, by itself is fine. But, when the bit was removed (bit 0) the _other_ bit locations also got changed. That's not OK because the bit values are documented to mean one specific thing. Users surely do not expect the meaning to change from kernel to kernel. The end result is that if someone had a script that did: sysctl vm.zone_reclaim_mode=1 it would have gone from enabling node reclaim for clean unmapped pages to writing out pages during node reclaim after the commit in question. That's not great. Put the bits back the way they were and add a comment so something like this is a bit harder to do again. Update the documentation to make it clear that the first bit is ignored. Link: https://lkml.kernel.org/r/20210219172555.FF0CDF23@viggo.jf.intel.com Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") Reviewed-by: Ben Widawsky <ben.widawsky@intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Daniel Wagner <dwagner@suse.de> Cc: "Tobin C. Harding" <tobin@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-03-04perf/arm-cmn: Fix PMU instance namingRobin Murphy1-1/+1
[ Upstream commit 79d7c3dca99fa96033695ddf5d495b775a3a137b ] Although it's neat to avoid the suffix for the typical case of a single PMU, it means systems with multiple CMN instances end up with inconsistent naming. I think it also breaks perf tool's "uncore alias" logic if the common instance prefix is also the full name of one. Avoid any surprises by not trying to be clever and simply numbering every instance, even when it might technically prove redundant. Fixes: 0ba64770a2f2 ("perf: Add Arm CMN-600 PMU driver") Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/649a2281233f193d59240b13ed91b57337c77b32.1611839564.git.robin.murphy@arm.com Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27x86/xen: Add xen_no_vector_callback option to test PCI INTX deliveryDavid Woodhouse1-0/+4
[ Upstream commit b36b0fe96af13460278bf9b173beced1bd15f85d ] It's useful to be able to test non-vector event channel delivery, to make sure Linux will work properly on older Xen which doesn't have it. It's also useful for those working on Xen and Xen-compatible hypervisors, because there are guest kernels still in active use which use PCI INTX even when vector delivery is available. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Link: https://lore.kernel.org/r/20210106153958.584169-4-dwmw2@infradead.org Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-01-27dm integrity: conditionally disable "recalculate" featureMikulas Patocka1-3/+9
commit 5c02406428d5219c367c5f53457698c58bc5f917 upstream. Otherwise a malicious user could (ab)use the "recalculate" feature that makes dm-integrity calculate the checksums in the background while the device is already usable. When the system restarts before all checksums have been calculated, the calculation continues where it was interrupted even if the recalculate feature is not requested the next time the dm device is set up. Disable recalculating if we use internal_hash or journal_hash with a key (e.g. HMAC) and we don't have the "legacy_recalculate" flag. This may break activation of a volume, created by an older kernel, that is not yet fully recalculated -- if this happens, the user should add the "legacy_recalculate" flag to constructor parameters. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Daniel Glockner <dg@emlix.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-12-21USB: UAS: introduce a quirk to set no_write_sameOliver Neukum1-0/+1
commit 8010622c86ca5bb44bc98492f5968726fc7c7a21 upstream. UAS does not share the pessimistic assumption storage is making that devices cannot deal with WRITE_SAME. A few devices supported by UAS, are reported to not deal well with WRITE_SAME. Those need a quirk. Add it to the device that needs it. Reported-by: David C. Partridge <david.partridge@perdrix.co.uk> Signed-off-by: Oliver Neukum <oneukum@suse.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20201209152639.9195-1-oneukum@suse.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>