Age | Commit message (Collapse) | Author | Files | Lines |
|
commit fd470a8beed88440b160d690344fbae05a0b9b1b upstream.
Unlike Intel's Enhanced IBRS feature, AMD's Automatic IBRS does not
provide protection to processes running at CPL3/user mode, see section
"Extended Feature Enable Register (EFER)" in the APM v2 at
https://bugzilla.kernel.org/attachment.cgi?id=304652
Explicitly enable STIBP to protect against cross-thread CPL3
branch target injections on systems with Automatic IBRS enabled.
Also update the relevant documentation.
Fixes: e7862eda309e ("x86/cpu: Support AMD Automatic IBRS")
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230720194727.67022-1-kim.phillips@amd.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3ba2e83334bed2b1980b59734e6e84dfaf96026c upstream.
AMD systems from Family 10h to 16h share MCA bank 4 across multiple CPUs.
Therefore, the threshold_bank structure for bank 4, and its threshold_block
structures, will be initialized once at boot time. And the kobject for the
shared bank will be added to each of the CPUs that share it. Furthermore,
the threshold_blocks for the shared bank will be added again to the bank's
kobject. These additions will increase the refcount for the bank's kobject.
For example, a shared bank with two blocks and shared across two CPUs will
be set up like this:
CPU0 init
bank create and add; bank refcount = 1; threshold_create_bank()
block 0 init and add; bank refcount = 2; allocate_threshold_blocks()
block 1 init and add; bank refcount = 3; allocate_threshold_blocks()
CPU1 init
bank add; bank refcount = 3; threshold_create_bank()
block 0 add; bank refcount = 4; __threshold_add_blocks()
block 1 add; bank refcount = 5; __threshold_add_blocks()
Currently in threshold_remove_bank(), if the bank is shared then
__threshold_remove_blocks() is called. Here the shared bank's kobject and
the bank's blocks' kobjects are deleted. This is done on the first call
even while the structures are still shared. Subsequent calls from other
CPUs that share the structures will attempt to delete the kobjects.
During kobject_del(), kobject->sd is removed. If the kobject is not part of
a kset with default_groups, then subsequent kobject_del() calls seem safe
even with kobject->sd == NULL.
Originally, the AMD MCA thresholding structures did not use default_groups.
And so the above behavior was not apparent.
However, a recent change implemented default_groups for the thresholding
structures. Therefore, kobject_del() will go down the sysfs_remove_groups()
code path. In this case, the first kobject_del() may succeed and remove
kobject->sd. But subsequent kobject_del() calls will give a WARNing in
kernfs_remove_by_name_ns() since kobject->sd == NULL.
Use kobject_put() on the shared bank's kobject when "removing" blocks. This
decrements the bank's refcount while keeping kobjects enabled until the
bank is no longer shared. At that point, kobject_put() will be called on
the blocks which drives their refcount to 0 and deletes them and also
decrementing the bank's refcount. And finally kobject_put() will be called
on the bank driving its refcount to 0 and deleting it.
The same example above:
CPU1 shutdown
bank is shared; bank refcount = 5; threshold_remove_bank()
block 0 put parent bank; bank refcount = 4; __threshold_remove_blocks()
block 1 put parent bank; bank refcount = 3; __threshold_remove_blocks()
CPU0 shutdown
bank is no longer shared; bank refcount = 3; threshold_remove_bank()
block 0 put block; bank refcount = 2; deallocate_threshold_blocks()
block 1 put block; bank refcount = 1; deallocate_threshold_blocks()
put bank; bank refcount = 0; threshold_remove_bank()
Fixes: 7f99cb5e6039 ("x86/CPU/AMD: Use default_groups in kobj_type")
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/alpine.LRH.2.02.2205301145540.25840@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 26a0652cb453c72f6aab0974bc4939e9b14f886b upstream.
Reject KVM_SET_SREGS{2} with -EINVAL if the incoming CR0 is invalid,
e.g. due to setting bits 63:32, illegal combinations, or to a value that
isn't allowed in VMX (non-)root mode. The VMX checks in particular are
"fun" as failure to disallow Real Mode for an L2 that is configured with
unrestricted guest disabled, when KVM itself has unrestricted guest
enabled, will result in KVM forcing VM86 mode to virtual Real Mode for
L2, but then fail to unwind the related metadata when synthesizing a
nested VM-Exit back to L1 (which has unrestricted guest enabled).
Opportunistically fix a benign typo in the prototype for is_valid_cr4().
Cc: stable@vger.kernel.org
Reported-by: syzbot+5feef0b9ee9c8e9e5689@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000f316b705fdf6e2b4@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c4abd7352023aa96114915a0bb2b88016a425cda upstream.
Stuff CR0 and/or CR4 to be compliant with a restricted guest if and only
if KVM itself is not configured to utilize unrestricted guests, i.e. don't
stuff CR0/CR4 for a restricted L2 that is running as the guest of an
unrestricted L1. Any attempt to VM-Enter a restricted guest with invalid
CR0/CR4 values should fail, i.e. in a nested scenario, KVM (as L0) should
never observe a restricted L2 with incompatible CR0/CR4, since nested
VM-Enter from L1 should have failed.
And if KVM does observe an active, restricted L2 with incompatible state,
e.g. due to a KVM bug, fudging CR0/CR4 instead of letting VM-Enter fail
does more harm than good, as KVM will often neglect to undo the side
effects, e.g. won't clear rmode.vm86_active on nested VM-Exit, and thus
the damage can easily spill over to L1. On the other hand, letting
VM-Enter fail due to bad guest state is more likely to contain the damage
to L2 as KVM relies on hardware to perform most guest state consistency
checks, i.e. KVM needs to be able to reflect a failed nested VM-Enter into
L1 irrespective of (un)restricted guest behavior.
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Fixes: bddd82d19e2e ("KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20230613203037.1968489-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 9f9116406120638b4d8db3831ffbc430dd2e1e95 ]
Commit c4e34dd99f2e ("x86: simplify load_unaligned_zeropad()
implementation") changes how exceptions around load_unaligned_zeropad()
handled. The kernel now uses the fault_address in fixup_exception() to
verify the address calculations for the load_unaligned_zeropad().
It works fine for #PF, but breaks on #VE since no fault address is
passed down to fixup_exception().
Propagating ve_info.gla down to fixup_exception() resolves the issue.
See commit 1e7769653b06 ("x86/tdx: Handle load_unaligned_zeropad()
page-cross to a shared page") for more context.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Michael Kelley <mikelley@microsoft.com>
Fixes: c4e34dd99f2e ("x86: simplify load_unaligned_zeropad() implementation")
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 522b1d69219d8f083173819fde04f994aa051a98 upstream.
Add a fix for the Zen2 VZEROUPPER data corruption bug where under
certain circumstances executing VZEROUPPER can cause register
corruption or leak data.
The optimal fix is through microcode but in the case the proper
microcode revision has not been applied, enable a fallback fix using
a chicken bit.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8b6f687743dacce83dbb0c7cfacf88bab00f808a upstream.
Avoid new and remove old forward declarations.
No functional changes.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 27c68c216ee1f1b086e789a64486e6511e380b8a upstream.
On SPR, the load latency event needs an auxiliary event in the same
group to work properly. There's a check in intel_pmu_hw_config()
for this to iterate sibling events and find a mem-loads-aux event.
The for_each_sibling_event() has a lockdep assert to make sure if it
disabled hardirq or hold leader->ctx->mutex. This works well if the
given event has a separate leader event since perf_try_init_event()
grabs the leader->ctx->mutex to protect the sibling list. But it can
cause a problem when the event itself is a leader since the event is
not initialized yet and there's no ctx for the event.
Actually I got a lockdep warning when I run the below command on SPR,
but I guess it could be a NULL pointer dereference.
$ perf record -d -e cpu/mem-loads/uP true
The code path to the warning is:
sys_perf_event_open()
perf_event_alloc()
perf_init_event()
perf_try_init_event()
x86_pmu_event_init()
hsw_hw_config()
intel_pmu_hw_config()
for_each_sibling_event()
lockdep_assert_event_ctx()
We don't need for_each_sibling_event() when it's a standalone event.
Let's return the error code directly.
Fixes: f3c0eba28704 ("perf: Add a few assertions")
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230704181516.3293665-1-namhyung@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0303c9729afc4094ef53e552b7b8cff7436028d6 ]
Niklāvs reported a boot regression on an Alderlake machine and bisected it
to commit 9df9d2f0471b ("init: Invoke arch_cpu_finalize_init() earlier").
By moving the invocation of arch_cpu_finalize_init() further down he
identified that efi_enter_virtual_mode() is the function which causes the
boot hang.
The main difference of the earlier invocation is that the boot CPU is
already fully initialized and mitigations and alternatives are applied.
But the only really interesting change turned out to be IBT, which is now
enabled before efi_enter_virtual_mode(). "ibt=off" on the kernel command
line cured the problem.
Inspection of the involved calls in efi_enter_virtual_mode() unearthed that
efi_set_virtual_address_map() is the only place in the kernel which invokes
an EFI call without the IBT safe wrapper. This went obviously unnoticed so
far as IBT was enabled later.
Use arch_efi_call_virt() instead of efi_call() to cure that.
Fixes: fe379fa4d199 ("x86/ibt: Disable IBT around firmware")
Fixes: 9df9d2f0471b ("init: Invoke arch_cpu_finalize_init() earlier")
Reported-by: Niklāvs Koļesņikovs <pinkflames.linux@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217602
Link: https://lore.kernel.org/r/87jzvm12q0.ffs@tglx
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0f88130e8a6fd185b0aeb5d8e286083735f2585a ]
Normally __swp_entry_to_pte() is never called with a value translating
to a valid PTE. The only known exception is pte_swap_tests(), resulting
in a WARN splat in Xen PV guests, as __pte_to_swp_entry() did
translate the PFN of the valid PTE to a guest local PFN, while
__swp_entry_to_pte() doesn't do the opposite translation.
Fix that by using __pte() in __swp_entry_to_pte() instead of open
coding the native variant of it.
For correctness do the similar conversion for __swp_entry_to_pmd().
Fixes: 05289402d717 ("mm/debug_vm_pgtable: add tests validating arch helpers for core MM features")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230306123259.12461-1-jgross@suse.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2fad201fe38ff9a692acedb1990ece2c52a29f95 ]
Although, IBS pmus can be invoked via their own interface, indirect
IBS invocation via core pmu events is also supported with fixed set
of events: cpu-cycles:p, r076:p (same as cpu-cycles:p) and r0C1:p
(micro-ops) for user convenience.
This indirect IBS invocation is broken since commit 66d258c5b048
("perf/core: Optimize perf_init_event()"), which added RAW pmu under
'pmu_idr' list and thus if event_init() fails with RAW pmu, it started
returning error instead of trying other pmus.
Forward precise events from core pmu to IBS by overwriting 'type' and
'config' in the kernel copy of perf_event_attr. Overwriting will cause
perf_init_event() to retry with updated 'type' and 'config', which will
automatically forward event to IBS pmu.
Without patch:
$ sudo ./perf record -C 0 -e r076:p -- sleep 1
Error:
The r076:p event is not supported.
With patch:
$ sudo ./perf record -C 0 -e r076:p -- sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.341 MB perf.data (37 samples) ]
Fixes: 66d258c5b048 ("perf/core: Optimize perf_init_event()")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230504110003.2548-3-ravi.bangoria@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a153f254e5cdf8fa3a1df90a6ffed3063fede154 ]
When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.
This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.
In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.
In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.
Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-6-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d053b481a5f16dbd4f020c6b3ebdf9173fdef0e2 ]
Replace size_or_mask and size_and_mask with the much easier concept of
high reserved bits.
While at it, instead of using constants in the MTRR code, use some new
[ bp:
- Drop mtrr_set_mask()
- Unbreak long lines
- Move struct mtrr_state_type out of the uapi header as it doesn't
belong there. It also fixes a HDRTEST breakage "unknown type name ‘bool’"
as Reported-by: kernel test robot <lkp@intel.com>
- Massage.
]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230502120931.20719-3-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Stable-dep-of: a153f254e5cd ("x86/xen: Set MTRR state when running as Xen PV initial domain")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c957f1f3c498bcce85c04e92e60afbae1fd10cde ]
In order to avoid mappings using the UC- cache attribute, set the
MTRR state to use WB caching as the default.
This is needed in order to cope with the fact that PAT is enabled,
while MTRRs are not supported by the hypervisor.
Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-5-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 29055dc74287467bd7a053d60b4afe753832960d ]
When running virtualized, MTRR access can be reduced (e.g. in Xen PV
guests or when running as a SEV-SNP guest under Hyper-V). Typically, the
hypervisor will not advertize the MTRR feature in CPUID data, resulting
in no MTRR memory type information being available for the kernel.
This has turned out to result in problems (Link tags below):
- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
- Xen PV dom0 mapping memory as WB which should be UC- instead
Solve those problems by allowing an MTRR static state override,
overwriting the empty state used today. In case such a state has been
set, don't call get_mtrr_state() in mtrr_bp_init().
The set state will only be used by mtrr_type_lookup(), as in all other
cases mtrr_enabled() is being checked, which will return false. Accept
the overwrite call only for selected cases when running as a guest.
Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by
just refusing them.
[ bp: Massage. ]
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/all/4fe9541e-4d4c-2b2a-f8c8-2d34a7284930@nerdbynature.de/
Link: https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Stable-dep-of: c957f1f3c498 ("x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f6b980646b93a8c585b4ed991b8a34e8fc6ef847 ]
The physical address width calculation in mtrr_bp_init() can easily be
replaced with using the already available value x86_phys_bits from
struct cpuinfo_x86.
The same information source can be used in mtrr/cleanup.c, removing the
need to pass that value on to mtrr_cleanup().
In print_mtrr_state() use x86_phys_bits instead of recalculating it
from size_or_mask.
Move setting of size_or_mask and size_and_mask into a dedicated new
function in mtrr/generic.c, enabling to make those 2 variables static,
as they are used in generic.c only now.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Michael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/20230502120931.20719-2-jgross@suse.com
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Stable-dep-of: c957f1f3c498 ("x86/hyperv: Set MTRR state when running as SEV-SNP Hyper-V guest")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 195edce08b63d293377f615f4f7f086715d2d212 ]
tl;dr: There is a race in the TDX private<=>shared conversion code
which could kill the TDX guest. Fix it by changing conversion
ordering to eliminate the window.
TDX hardware maintains metadata to track which pages are private and
shared. Additionally, TDX guests use the guest x86 page tables to
specify whether a given mapping is intended to be private or shared.
Bad things happen when the intent and metadata do not match.
So there are two thing in play:
1. "the page" -- the physical TDX page metadata
2. "the mapping" -- the guest-controlled x86 page table intent
For instance, an unrecoverable exit to VMM occurs if a guest touches a
private mapping that points to a shared physical page.
In summary:
* Private mapping => Private Page == OK (obviously)
* Shared mapping => Shared Page == OK (obviously)
* Private mapping => Shared Page == BIG BOOM!
* Shared mapping => Private Page == OK-ish
(It will read generate a recoverable #VE via handle_mmio())
Enter load_unaligned_zeropad(). It can touch memory that is adjacent but
otherwise unrelated to the memory it needs to touch. It will cause one
of those unrecoverable exits (aka. BIG BOOM) if it blunders into a
shared mapping pointing to a private page.
This is a problem when __set_memory_enc_pgtable() converts pages from
shared to private. It first changes the mapping and second modifies
the TDX page metadata. It's moving from:
* Shared mapping => Shared Page == OK
to:
* Private mapping => Shared Page == BIG BOOM!
This means that there is a window with a shared mapping pointing to a
private page where load_unaligned_zeropad() can strike.
Add a TDX handler for guest.enc_status_change_prepare(). This converts
the page from shared to private *before* the page becomes private. This
ensures that there is never a private mapping to a shared page.
Leave a guest.enc_status_change_finish() in place but only use it for
private=>shared conversions. This will delay updating the TDX metadata
marking the page private until *after* the mapping matches the metadata.
This also ensures that there is never a private mapping to a shared page.
[ dhansen: rewrite changelog ]
Fixes: 7dbde7631629 ("x86/mm/cpa: Add support for TDX shared memory")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20230606095622.1939-3-kirill.shutemov%40linux.intel.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 3f6819dd192ef4f0c568ec3e9d6d408b3fa1ad3d ]
TDX code is going to provide guest.enc_status_change_prepare() that is
able to fail. TDX will use the call to convert the GPA range from shared
to private. This operation can fail.
Add a way to return an error from the callback.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Link: https://lore.kernel.org/all/20230606095622.1939-2-kirill.shutemov%40linux.intel.com
Stable-dep-of: 195edce08b63 ("x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5dee19b6b2b194216919b99a1f5af2949a754016 ]
When calculating an end address based on an unsigned int number of pages,
any value greater than or equal to 0x100000 that is shift PAGE_SHIFT bits
results in a 0 value, resulting in an invalid end address. Change the
number of pages variable in various routines from an unsigned int to an
unsigned long to calculate the end address correctly.
Fixes: 5e5ccff60a29 ("x86/sev: Add helper for validating pages in early enc attribute changes")
Fixes: dc3f3d2474b8 ("x86/mm: Validate memory when changing the C-bit")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/6a6e4eea0e1414402bac747744984fa4e9c01bb6.1686063086.git.thomas.lendacky@amd.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2997d94b5dd0e8b10076f5e0b6f18410c73e28bd ]
When writing a task id to the "tasks" file in an rdtgroup,
rdtgroup_tasks_write() treats the pid as a number in the current pid
namespace. But when reading the "tasks" file, rdtgroup_tasks_show() shows
the list of global pids from the init namespace, which is confusing and
incorrect.
To be more robust, let the "tasks" file only show pids in the current pid
namespace.
Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Reinette Chatre <reinette.chatre@intel.com>
Acked-by: Fenghua Yu <fenghua.yu@intel.com>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lore.kernel.org/all/20230116071246.97717-1-shawnwang@linux.alibaba.com/
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit c2508ec5a58db67093f4fb8bf89a9a7c53a109e9 upstream.
.. and make x86 use it.
This basically extracts the existing x86 "find and expand faulting vma"
code, but extends it to also take the mmap lock for writing in case we
actually do need to expand the vma.
We've historically short-circuited that case, and have some rather ugly
special logic to serialize the stack segment expansion (since we only
hold the mmap lock for reading) that doesn't match the normal VM
locking.
That slight violation of locking worked well, right up until it didn't:
the maple tree code really does want proper locking even for simple
extension of an existing vma.
So extract the code for "look up the vma of the fault" from x86, fix it
up to do the necessary write locking, and make it available as a helper
function for other architectures that can use the common helper.
Note: I say "common helper", but it really only handles the normal
stack-grows-down case. Which is all architectures except for PA-RISC
and IA64. So some rare architectures can't use the helper, but if they
care they'll just need to open-code this logic.
It's also worth pointing out that this code really would like to have an
optimistic "mmap_upgrade_trylock()" to make it quicker to go from a
read-lock (for the common case) to taking the write lock (for having to
extend the vma) in the normal single-threaded situation where there is
no other locking activity.
But that _is_ all the very uncommon special case, so while it would be
nice to have such an operation, it probably doesn't matter in reality.
I did put in the skeleton code for such a possible future expansion,
even if it only acts as pseudo-documentation for what we're doing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d7893093a7417527c0d73c9832244e65c9d0114f upstream.
TLDR: It's a mess.
When kexec() is executed on a system with offline CPUs, which are parked in
mwait_play_dead() it can end up in a triple fault during the bootup of the
kexec kernel or cause hard to diagnose data corruption.
The reason is that kexec() eventually overwrites the previous kernel's text,
page tables, data and stack. If it writes to the cache line which is
monitored by a previously offlined CPU, MWAIT resumes execution and ends
up executing the wrong text, dereferencing overwritten page tables or
corrupting the kexec kernels data.
Cure this by bringing the offlined CPUs out of MWAIT into HLT.
Write to the monitored cache line of each offline CPU, which makes MWAIT
resume execution. The written control word tells the offlined CPUs to issue
HLT, which does not have the MWAIT problem.
That does not help, if a stray NMI, MCE or SMI hits the offlined CPUs as
those make it come out of HLT.
A follow up change will put them into INIT, which protects at least against
NMI and SMI.
Fixes: ea53069231f9 ("x86, hotplug: Use mwait to offline a processor, fix the legacy case")
Reported-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.492257119@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f9c9987bf52f4e42e940ae217333ebb5a4c3b506 upstream.
Monitoring idletask::thread_info::flags in mwait_play_dead() has been an
obvious choice as all what is needed is a cache line which is not written
by other CPUs.
But there is a use case where a "dead" CPU needs to be brought out of
MWAIT: kexec().
This is required as kexec() can overwrite text, pagetables, stacks and the
monitored cacheline of the original kernel. The latter causes MWAIT to
resume execution which obviously causes havoc on the kexec kernel which
results usually in triple faults.
Use a dedicated per CPU storage to prepare for that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.434553750@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2affa6d6db28855e6340b060b809c23477aa546e upstream.
The wmb()s before sending the IPIs are not synchronizing anything.
If at all then the apic IPI functions have to provide or act as appropriate
barriers.
Remove these cargo cult barriers which have no explanation of what they are
synchronizing.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230615193330.378358382@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9b040453d4440659f33dc6f0aa26af418ebfe70b upstream.
stop_this_cpu() tests CPUID leaf 0x8000001f::EAX unconditionally. Intel
CPUs return the content of the highest supported leaf when a non-existing
leaf is read, while AMD CPUs return all zeros for unsupported leafs.
So the result of the test on Intel CPUs is lottery.
While harmless it's incorrect and causes the conditional wbinvd() to be
issued where not required.
Check whether the leaf is supported before reading it.
[ tglx: Adjusted changelog ]
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/20230615193330.322186388@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f5e7eb7868e42227ac426c96d437117e6e06e8e upstream.
Tony reported intermittent lockups on poweroff. His analysis identified the
wbinvd() in stop_this_cpu() as the culprit. This was added to ensure that
on SME enabled machines a kexec() does not leave any stale data in the
caches when switching from encrypted to non-encrypted mode or vice versa.
That wbinvd() is conditional on the SME feature bit which is read directly
from CPUID. But that readout does not check whether the CPUID leaf is
available or not. If it's not available the CPU will return the value of
the highest supported leaf instead. Depending on the content the "SME" bit
might be set or not.
That's incorrect but harmless. Making the CPUID readout conditional makes
the observed hangs go away, but it does not fix the underlying problem:
CPU0 CPU1
stop_other_cpus()
send_IPIs(REBOOT); stop_this_cpu()
while (num_online_cpus() > 1); set_online(false);
proceed... -> hang
wbinvd()
WBINVD is an expensive operation and if multiple CPUs issue it at the same
time the resulting delays are even larger.
But CPU0 already observed num_online_cpus() going down to 1 and proceeds
which causes the system to hang.
This issue exists independent of WBINVD, but the delays caused by WBINVD
make it more prominent.
Make this more robust by adding a cpumask which is initialized to the
online CPU mask before sending the IPIs and CPUs clear their bit in
stop_this_cpu() after the WBINVD completed. Check for that cpumask to
become empty in stop_other_cpus() instead of watching num_online_cpus().
The cpumask cannot plug all holes either, but it's better than a raw
counter and allows to restrict the NMI fallback IPI to be sent only the
CPUs which have not reported within the timeout window.
Fixes: 08f253ec3767 ("x86/cpu: Clear SME feature flag when not in use")
Reported-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/3817d810-e0f1-8ef8-0bbd-663b919ca49b@cybernetics.com
Link: https://lore.kernel.org/r/87h6r770bv.ffs@tglx
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a32b0f0db3f396f1c9be2fe621e77c09ec3d8e7d upstream.
Do the same as early loading - load on both threads.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20230605141332.25948-1-bp@alien8.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Drop the __weak attribute from a function prototype as it otherwise
leads to the function getting replaced by a dummy stub
- Fix the umask value setup of the frontend event as former is
different on two Intel cores
* tag 'perf_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix the FRONTEND encoding on GNR and MTL
perf/core: Drop __weak attribute from arch_perf_update_userpage() prototype
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
- Add a ORC format hash to vmlinux and modules in order for other tools
which use it, to detect changes to it and adapt accordingly
* tag 'objtool_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/unwind/orc: Add ELF section with ORC version identifier
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Do not use set_pgd() when updating the KASLR trampoline pgd entry
because that updates the user PGD too on KPTI builds, resulting in
memory corruption
- Prevent a panic in the IO-APIC setup code due to conflicting command
line parameters
* tag 'x86_urgent_for_v6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Fix kernel panic when booting with intremap=off and x2apic_phys
x86/mm: Avoid using set_pgd() outside of real PGD pages
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from ipsec, bpf, mptcp and netfilter.
Current release - regressions:
- netfilter: add NFT_TRANS_PREPARE_ERROR to deal with bound set/chain
- eth: mlx5e:
- fix scheduling of IPsec ASO query while in atomic
- free IRQ rmap and notifier on kernel shutdown
Current release - new code bugs:
- phy: manual remove LEDs to ensure correct ordering
Previous releases - regressions:
- mptcp: fix possible divide by zero in recvmsg()
- dsa: revert "net: phy: dp83867: perform soft reset and retain
established link"
Previous releases - always broken:
- sched: netem: acquire qdisc lock in netem_change()
- bpf:
- fix verifier id tracking of scalars on spill
- fix NULL dereference on exceptions
- accept function names that contain dots
- netfilter: disallow element updates of bound anonymous sets
- mptcp: ensure listener is unhashed before updating the sk status
- xfrm:
- add missed call to delete offloaded policies
- fix inbound ipv4/udp/esp packets to UDPv6 dualstack sockets
- selftests: fixes for FIPS mode
- dsa: mt7530: fix multiple CPU ports, BPDU and LLDP handling
- eth: sfc: use budget for TX completions
Misc:
- wifi: iwlwifi: add support for SO-F device with PCI id 0x7AF0"
* tag 'net-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
revert "net: align SO_RCVMARK required privileges with SO_MARK"
net: wwan: iosm: Convert single instance struct member to flexible array
sch_netem: acquire qdisc lock in netem_change()
selftests: forwarding: Fix race condition in mirror installation
wifi: mac80211: report all unusable beacon frames
mptcp: ensure listener is unhashed before updating the sk status
mptcp: drop legacy code around RX EOF
mptcp: consolidate fallback and non fallback state machine
mptcp: fix possible list corruption on passive MPJ
mptcp: fix possible divide by zero in recvmsg()
mptcp: handle correctly disconnect() failures
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
bpf/btf: Accept function names that contain dots
Revert "net: phy: dp83867: perform soft reset and retain established link"
net: mdio: fix the wrong parameters
netfilter: nf_tables: Fix for deleting base chains with payload
netfilter: nfnetlink_osf: fix module autoload
netfilter: nf_tables: drop module reference after updating chain
netfilter: nf_tables: disallow timeout for anonymous sets
netfilter: nf_tables: disallow updates of anonymous sets
...
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:
====================
pull-request: bpf 2023-06-21
We've added 7 non-merge commits during the last 14 day(s) which contain
a total of 7 files changed, 181 insertions(+), 15 deletions(-).
The main changes are:
1) Fix a verifier id tracking issue with scalars upon spill,
from Maxim Mikityanskiy.
2) Fix NULL dereference if an exception is generated while a BPF
subprogram is running, from Krister Johansen.
3) Fix a BTF verification failure when compiling kernel with LLVM_IAS=0,
from Florent Revest.
4) Fix expected_attach_type enforcement for kprobe_multi link,
from Jiri Olsa.
5) Fix a bpf_jit_dump issue for x86_64 to pick the correct JITed image,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Force kprobe multi expected_attach_type for kprobe_multi link
bpf/btf: Accept function names that contain dots
selftests/bpf: add a test for subprogram extables
bpf: ensure main program has an extable
bpf: Fix a bpf_jit_dump issue for x86_64 with sysctl bpf_jit_enable.
selftests/bpf: Add test cases to assert proper ID tracking on spill
bpf: Fix verifier id tracking of scalars on spill
====================
Link: https://lore.kernel.org/r/20230621101116.16122-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- Fix races in Hyper-V PCI controller (Dexuan Cui)
- Fix handling of hyperv_pcpu_input_arg (Michael Kelley)
- Fix vmbus_wait_for_unload to scan present CPUs (Michael Kelley)
- Call hv_synic_free in the failure path of hv_synic_alloc (Dexuan Cui)
- Add noop for real mode handlers for virtual trust level code (Saurabh
Sengar)
* tag 'hyperv-fixes-signed-20230619' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
PCI: hv: Add a per-bus mutex state_lock
Revert "PCI: hv: Fix a timing issue which causes kdump to fail occasionally"
PCI: hv: Remove the useless hv_pcichild_state from struct hv_pci_dev
PCI: hv: Fix a race condition in hv_irq_unmask() that can cause panic
PCI: hv: Fix a race condition bug in hv_pci_query_relations()
arm64/hyperv: Use CPUHP_AP_HYPERV_ONLINE state to fix CPU online sequencing
x86/hyperv: Fix hyperv_pcpu_input_arg handling when CPUs go online/offline
Drivers: hv: vmbus: Fix vmbus_wait_for_unload() to scan present CPUs
Drivers: hv: vmbus: Call hv_synic_free() if hv_synic_alloc() fails
x86/hyperv/vtl: Add noop for realmode pointers
|
|
When booting with "intremap=off" and "x2apic_phys" on the kernel command
line, the physical x2APIC driver ends up being used even when x2APIC
mode is disabled ("intremap=off" disables x2APIC mode). This happens
because the first compound condition check in x2apic_phys_probe() is
false due to x2apic_mode == 0 and so the following one returns true
after default_acpi_madt_oem_check() having already selected the physical
x2APIC driver.
This results in the following panic:
kernel BUG at arch/x86/kernel/apic/io_apic.c:2409!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0-rc2-ver4.1rc2 #2
Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.3.6 07/06/2021
RIP: 0010:setup_IO_APIC+0x9c/0xaf0
Call Trace:
<TASK>
? native_read_msr
apic_intr_mode_init
x86_late_time_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
secondary_startup_64_no_verify
</TASK>
which is:
setup_IO_APIC:
apic_printk(APIC_VERBOSE, "ENABLING IO-APIC IRQs\n");
for_each_ioapic(ioapic)
BUG_ON(mp_irqdomain_create(ioapic));
Return 0 to denote that x2APIC has not been enabled when probing the
physical x2APIC driver.
[ bp: Massage commit message heavily. ]
Fixes: 9ebd680bd029 ("x86, apic: Use probe routines to simplify apic selection")
Signed-off-by: Dheeraj Kumar Srivastava <dheerajkumar.srivastava@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kishon Vijay Abraham I <kvijayab@amd.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20230616212236.1389-1-dheerajkumar.srivastava@amd.com
|
|
These commits
a494aef23dfc ("PCI: hv: Replace retarget_msi_interrupt_params with hyperv_pcpu_input_arg")
2c6ba4216844 ("PCI: hv: Enable PCI pass-thru devices in Confidential VMs")
update the Hyper-V virtual PCI driver to use the hyperv_pcpu_input_arg
because that memory will be correctly marked as decrypted or encrypted
for all VM types (CoCo or normal). But problems ensue when CPUs in the
VM go online or offline after virtual PCI devices have been configured.
When a CPU is brought online, the hyperv_pcpu_input_arg for that CPU is
initialized by hv_cpu_init() running under state CPUHP_AP_ONLINE_DYN.
But this state occurs after state CPUHP_AP_IRQ_AFFINITY_ONLINE, which
may call the virtual PCI driver and fault trying to use the as yet
uninitialized hyperv_pcpu_input_arg. A similar problem occurs in a CoCo
VM if the MMIO read and write hypercalls are used from state
CPUHP_AP_IRQ_AFFINITY_ONLINE.
When a CPU is taken offline, IRQs may be reassigned in state
CPUHP_TEARDOWN_CPU. Again, the virtual PCI driver may fault trying to
use the hyperv_pcpu_input_arg that has already been freed by a
higher state.
Fix the onlining problem by adding state CPUHP_AP_HYPERV_ONLINE
immediately after CPUHP_AP_ONLINE_IDLE (similar to CPUHP_AP_KVM_ONLINE)
and before CPUHP_AP_IRQ_AFFINITY_ONLINE. Use this new state for
Hyper-V initialization so that hyperv_pcpu_input_arg is allocated
early enough.
Fix the offlining problem by not freeing hyperv_pcpu_input_arg when
a CPU goes offline. Retain the allocated memory, and reuse it if
the CPU comes back online later.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/r/1684862062-51576-1-git-send-email-mikelley@microsoft.com
Signed-off-by: Wei Liu <wei.liu@kernel.org>
|
|
KPTI keeps around two PGDs: one for userspace and another for the
kernel. Among other things, set_pgd() contains infrastructure to
ensure that updates to the kernel PGD are reflected in the user PGD
as well.
One side-effect of this is that set_pgd() expects to be passed whole
pages. Unfortunately, init_trampoline_kaslr() passes in a single entry:
'trampoline_pgd_entry'.
When KPTI is on, set_pgd() will update 'trampoline_pgd_entry' (an
8-Byte globally stored [.bss] variable) and will then proceed to
replicate that value into the non-existent neighboring user page
(located +4k away), leading to the corruption of other global [.bss]
stored variables.
Fix it by directly assigning 'trampoline_pgd_entry' and avoiding
set_pgd().
[ dhansen: tweak subject and changelog ]
Fixes: 0925dda5962e ("x86/mm/KASLR: Use only one PUD entry for real mode trampoline")
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Lee Jones <lee@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/all/20230614163859.924309-1-lee@kernel.org/g
|
|
Commits ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC
metadata") and fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in
two") changed the ORC format. Although ORC is internal to the kernel,
it's the only way for external tools to get reliable kernel stack traces
on x86-64. In particular, the drgn debugger [1] uses ORC for stack
unwinding, and these format changes broke it [2]. As the drgn
maintainer, I don't care how often or how much the kernel changes the
ORC format as long as I have a way to detect the change.
It suffices to store a version identifier in the vmlinux and kernel
module ELF files (to use when parsing ORC sections from ELF), and in
kernel memory (to use when parsing ORC from a core dump+symbol table).
Rather than hard-coding a version number that needs to be manually
bumped, Peterz suggested hashing the definitions from orc_types.h. If
there is a format change that isn't caught by this, the hashing script
can be updated.
This patch adds an .orc_header allocated ELF section containing the
20-byte hash to vmlinux and kernel modules, along with the corresponding
__start_orc_header and __stop_orc_header symbols in vmlinux.
1: https://github.com/osandov/drgn
2: https://github.com/osandov/drgn/issues/303
Fixes: ffb1b4a41016 ("x86/unwind/orc: Add 'signal' field to ORC metadata")
Fixes: fb799447ae29 ("x86,objtool: Split UNWIND_HINT_EMPTY in two")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@kernel.org>
Link: https://lkml.kernel.org/r/aef9c8dc43915b886a8c48509a12ec1b006ca1ca.1686690801.git.osandov@osandov.com
|
|
When counting a FRONTEND event, the MSR_PEBS_FRONTEND is not correctly
set on GNR and MTL p-core.
The umask value for the FRONTEND events is changed on GNR and MTL. The
new umask is missing in the extra_regs[] table.
Add a dedicated intel_gnr_extra_regs[] for GNR and MTL p-core.
Fixes: bc4000fdb009 ("perf/x86/intel: Add Granite Rapids")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20230615173242.3726364-1-kan.liang@linux.intel.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. 14 are cc:stable and the remainder address issues which
were introduced during this development cycle or which were considered
inappropriate for a backport"
* tag 'mm-hotfixes-stable-2023-06-12-12-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
zswap: do not shrink if cgroup may not zswap
page cache: fix page_cache_next/prev_miss off by one
ocfs2: check new file size on fallocate call
mailmap: add entry for John Keeping
mm/damon/core: fix divide error in damon_nr_accesses_to_accesses_bp()
epoll: ep_autoremove_wake_function should use list_del_init_careful
mm/gup_test: fix ioctl fail for compat task
nilfs2: reject devices with insufficient block count
ocfs2: fix use-after-free when unmounting read-only filesystem
lib/test_vmalloc.c: avoid garbage in page array
nilfs2: fix possible out-of-bounds segment allocation in resize ioctl
riscv/purgatory: remove PGO flags
powerpc/purgatory: remove PGO flags
x86/purgatory: remove PGO flags
kexec: support purgatories with .text.hot sections
mm/uffd: allow vma to merge as much as possible
mm/uffd: fix vma operation where start addr cuts part of vma
radix-tree: move declarations to header
nilfs2: fix incomplete buffer cleanup in nilfs_btnode_abort_change_key()
|
|
If profile-guided optimization is enabled, the purgatory ends up with
multiple .text sections. This is not supported by kexec and crashes the
system.
Link: https://lkml.kernel.org/r/20230321-kexec_clang16-v7-2-b05c520b7296@chromium.org
Fixes: 930457057abe ("kernel/kexec_file.c: split up __kexec_load_puragory")
Signed-off-by: Ricardo Ribalda <ribalda@chromium.org>
Cc: <stable@vger.kernel.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov (AMD) <bp@alien8.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmer@rivosinc.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Philipp Rudo <prudo@redhat.com>
Cc: Ross Zwisler <zwisler@google.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Rix <trix@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The sysctl net/core/bpf_jit_enable does not work now due to commit
1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc"). The
commit saved the jitted insns into 'rw_image' instead of 'image'
which caused bpf_jit_dump not dumping proper content.
With 'echo 2 > /proc/sys/net/core/bpf_jit_enable', run
'./test_progs -t fentry_test'. Without this patch, one of jitted
image for one particular prog is:
flen=17 proglen=92 pass=4 image=0000000014c64883 from=test_progs pid=1807
00000000: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
00000010: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
00000020: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
00000030: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
00000040: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
00000050: cc cc cc cc cc cc cc cc cc cc cc cc
With this patch, the jitte image for the same prog is:
flen=17 proglen=92 pass=4 image=00000000b90254b7 from=test_progs pid=1809
00000000: f3 0f 1e fa 0f 1f 44 00 00 66 90 55 48 89 e5 f3
00000010: 0f 1e fa 31 f6 48 8b 57 00 48 83 fa 07 75 2b 48
00000020: 8b 57 10 83 fa 09 75 22 48 8b 57 08 48 81 e2 ff
00000030: 00 00 00 48 83 fa 08 75 11 48 8b 7f 18 be 01 00
00000040: 00 00 48 83 ff 0a 74 02 31 f6 48 bf 18 d0 14 00
00000050: 00 c9 ff ff 48 89 77 00 31 c0 c9 c3
Fixes: 1022a5498f6f ("bpf, x86_64: Use bpf_jit_binary_pack_alloc")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/bpf/20230609005439.3173569-1-yhs@fb.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Set up the kernel CS earlier in the boot process in case EFI boots
the kernel after bypassing the decompressor and the CS descriptor
used ends up being the EFI one which is not mapped in the identity
page table, leading to early SEV/SNP guest communication exceptions
resulting in the guest crashing
* tag 'x86_urgent_for_v6.4_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/head/64: Switch to KERNEL_CS as soon as new GDT is installed
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Address some fallout of the locking rework, this time affecting the
way the vgic is configured
- Fix an issue where the page table walker frees a subtree and then
proceeds with walking what it has just freed...
- Check that a given PA donated to the guest is actually memory (only
affecting pKVM)
- Correctly handle MTE CMOs by Set/Way
- Fix the reported address of a watchpoint forwarded to userspace
- Fix the freeing of the root of stage-2 page tables
- Stop creating spurious PMU events to perform detection of the
default PMU and use the existing PMU list instead
x86:
- Fix a memslot lookup bug in the NX recovery thread that could
theoretically let userspace bypass the NX hugepage mitigation
- Fix a s/BLOCKING/PENDING bug in SVM's vNMI support
- Account exit stats for fastpath VM-Exits that never leave the super
tight run-loop
- Fix an out-of-bounds bug in the optimized APIC map code, and add a
regression test for the race"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: selftests: Add test for race in kvm_recalculate_apic_map()
KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds
KVM: x86: Account fastpath-only VM-Exits in vCPU stats
KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK
KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker
KVM: arm64: Document default vPMU behavior on heterogeneous systems
KVM: arm64: Iterate arm_pmus list to probe for default PMU
KVM: arm64: Drop last page ref in kvm_pgtable_stage2_free_removed()
KVM: arm64: Populate fault info for watchpoint
KVM: arm64: Reload PTE after invoking walker callback on preorder traversal
KVM: arm64: Handle trap of tagged Set/Way CMOs
arm64: Add missing Set/Way CMO encodings
KVM: arm64: Prevent unconditional donation of unmapped regions from the host
KVM: arm64: vgic: Fix a comment
KVM: arm64: vgic: Fix locking comment
KVM: arm64: vgic: Wrap vgic_its_create() with config_lock
KVM: arm64: vgic: Fix a circular locking issue
|
|
Bail from kvm_recalculate_phys_map() and disable the optimized map if the
target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added
and/or enabled its local APIC after the map was allocated. This fixes an
out-of-bounds access bug in the !x2apic_format path where KVM would write
beyond the end of phys_map.
Check the x2APIC ID regardless of whether or not x2APIC is enabled,
as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and
the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC
being enabled, i.e. the check won't get false postivies.
Note, this also affects the x2apic_format path, which previously just
ignored the "x2apic_id > new->max_apic_id" case. That too is arguably a
bug fix, as ignoring the vCPU meant that KVM would not send interrupts to
the vCPU until the next map recalculation. In practice, that "bug" is
likely benign as a newly present vCPU/APIC would immediately trigger a
recalc. But, there's no functional downside to disabling the map, and
a future patch will gracefully handle the -E2BIG case by retrying instead
of simply disabling the optimized map.
Opportunistically add a sanity check on the xAPIC ID size, along with a
comment explaining why the xAPIC ID is guaranteed to be "good".
Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: 5b84b0291702 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
The call to startup_64_setup_env() will install a new GDT but does not
actually switch to using the KERNEL_CS entry until returning from the
function call.
Commit bcce82908333 ("x86/sev: Detect/setup SEV/SME features earlier in
boot") moved the call to sme_enable() earlier in the boot process and in
between the call to startup_64_setup_env() and the switch to KERNEL_CS.
An SEV-ES or an SEV-SNP guest will trigger #VC exceptions during the call
to sme_enable() and if the CS pushed on the stack as part of the exception
and used by IRETQ is not mapped by the new GDT, then problems occur.
Today, the current CS when entering startup_64 is the kernel CS value
because it was set up by the decompressor code, so no issue is seen.
However, a recent patchset that looked to avoid using the legacy
decompressor during an EFI boot exposed this bug. At entry to startup_64,
the CS value is that of EFI and is not mapped in the new kernel GDT. So
when a #VC exception occurs, the CS value used by IRETQ is not valid and
the guest boot crashes.
Fix this issue by moving the block that switches to the KERNEL_CS value to
be done immediately after returning from startup_64_setup_env().
Fixes: bcce82908333 ("x86/sev: Detect/setup SEV/SME features earlier in boot")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/all/6ff1f28af2829cc9aea357ebee285825f90a431f.1684340801.git.thomas.lendacky%40amd.com
|
|
Increment vcpu->stat.exits when handling a fastpath VM-Exit without
going through any part of the "slow" path. Not bumping the exits stat
can result in wildly misleading exit counts, e.g. if the primary reason
the guest is exiting is to program the TSC deadline timer.
Fixes: 404d5d7bff0d ("KVM: X86: Introduce more exit_fastpath_completion enum values")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
I noticed that with vCPU count large enough (> 16) they sometimes froze at
boot.
With vCPU count of 64 they never booted successfully - suggesting some kind
of a race condition.
Since adding "vnmi=0" module parameter made these guests boot successfully
it was clear that the problem is most likely (v)NMI-related.
Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
"multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
and the NMI parts of eventinj test.
The issue was that once one NMI was being serviced no other NMI was allowed
to be set pending (NMI limit = 0), which was traced to
svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
than for the "NMI pending" flag.
Fix this by testing for the right flag in svm_is_vnmi_pending().
Once this is done, the NMI-related kvm-unit-tests pass successfully and
the Windows guest no longer freezes at boot.
Fixes: fa4c027a7956 ("KVM: x86: Add support for SVM's Virtual NMI")
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Factor in the address space (non-SMM vs. SMM) of the target shadow page
when recovering potential NX huge pages, otherwise KVM will retrieve the
wrong memslot when zapping shadow pages that were created for SMM. The
bug most visibly manifests as a WARN on the memslot being non-NULL, but
the worst case scenario is that KVM could unaccount the shadow page
without ensuring KVM won't install a huge page, i.e. if the non-SMM slot
is being dirty logged, but the SMM slot is not.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015
kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re
RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450
R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98
R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0
Call Trace:
<TASK>
__pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm]
kvm_vm_worker_thread+0x106/0x1c0 [kvm]
kthread+0xd9/0x100
ret_from_fork+0x2c/0x50
</TASK>
---[ end trace 0000000000000000 ]---
This bug was exposed by commit edbdb43fc96b ("KVM: x86: Preserve TDP MMU
roots until they are explicitly invalidated"), which allowed KVM to retain
SMM TDP MMU roots effectively indefinitely. Before commit edbdb43fc96b,
KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages
once all vCPUs exited SMM, which made the window where this bug (recovering
an SMM NX huge page) could be encountered quite tiny. To hit the bug, the
NX recovery thread would have to run while at least one vCPU was in SMM.
Most VMs typically only use SMM during boot, and so the problematic shadow
pages were gone by the time the NX recovery thread ran.
Now that KVM preserves TDP MMU roots until they are explicitly invalidated
(e.g. by a memslot deletion), the window to trigger the bug is effectively
never closed because most VMMs don't delete memslots after boot (except
for a handful of special scenarios).
Fixes: eb298605705a ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages")
Reported-by: Fabio Coatti <fabio.coatti@gmail.com>
Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.
To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.
This is a modified version of the patch written by Mike Christie
<michael.christie@oracle.com> which was a modified version of patch
originally written by Linus.
Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.
Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c. Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn. vhost_worker now returns true if work was done.
The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit. This collection clears
__fatal_signal_pending. This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.
For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file. To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec. The
coredump code and de_thread have been modified to ignore vhost threads.
Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.
Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.
Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Co-developed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Mike Christie <michael.christie@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix an alignment crash in x86/aria"
* tag 'v6.4-p3' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: x86/aria - Use 16 byte alignment for GFNI constant vectors
|