Age | Commit message (Collapse) | Author | Files | Lines |
|
Ensure that SME traps are disabled for (h)VHE when setting up
EL2, as they are for nVHE.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-5-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Use the architectural feature trap/control register that
corresponds to the current KVM mode, i.e., CPTR_EL2 or CPACR_EL1,
when setting up SVE feature traps.
Signed-off-by: Fuad Tabba <tabba@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230724123829.2929609-4-tabba@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The code for checking whether the kernel is in (h)VHE mode is
repeated, and will be needed again in future patches. Factor it
out in a macro.
No functional change intended.
No change in emitted assembly code intended.
Signed-off-by: Fuad Tabba <tabba@google.com>
Link: https://lore.kernel.org/kvmarm/20230724123829.2929609-3-tabba@google.com/
Reviewed-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
kvm_arm_hardware_enabled is rather misleading, since it doesn't track
the state of all hardware resources needed for running a VM. What it
actually tracks is whether or not the hyp cpu context has been
initialized.
Since we're now at the point where vgic + timer irq management has
been separated from kvm_arm_hardware_enabled, rephrase it (and the
associated helpers) to make it clear what state is being tracked.
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230719231855.262973-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
When running in protected mode, the hyp stub is disabled after pKVM is
initialized, meaning the host cannot enable/disable the hyp at
runtime. As such, kvm_arm_hardware_enabled is always 1 after
initialization, and kvm_arch_hardware_enable() never enables the vgic
maintenance irq or timer irqs.
Unconditionally enable/disable the vgic + timer irqs in the respective
calls, instead relying on the percpu bookkeeping in the generic code
to keep track of which cpus have the interrupts unmasked.
Fixes: 466d27e48d7c ("KVM: arm64: Simplify the CPUHP logic")
Reported-by: Oliver Upton <oliver.upton@linux.dev>
Suggested-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
Link: https://lore.kernel.org/r/20230719175400.647154-1-rananta@google.com
Acked-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
pKVM initialization fails on systems with v1.1+ FF-A implementations, as
the hyp does a strict match on the returned version from FFA_VERSION.
This is a stronger assertion than required by the specification, which
requires minor revisions be backwards compatible with earlier revisions
of the same major version.
Relax the check in hyp_ffa_init() to only test the returned major
version. Even though v1.1 broke ABI, the expectation is that firmware
incapable of using the v1.0 ABI return NOT_SUPPORTED instead of a valid
version.
Acked-by: Marc Zyngier <maz@kernel.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20230718184537.3220867-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
For those PMU system registers defined in sys_reg_descs[], use macro
PMU_SYS_REG() / PMU_PMEVCNTR_EL0 / PMU_PMEVTYPER_EL0 to define them, and
later two macros call macro PMU_SYS_REG() actually.
Currently the input parameter of PMU_SYS_REG() is another macro which is
calculation formula of the value of system registers, so for example, if
we want to "SYS_PMINTENSET_EL1" as the name of sys register, actually
the name we get is as following:
(((3) << 19) | ((0) << 16) | ((9) << 12) | ((14) << 8) | ((1) << 5))
The name of system register is used in some tracepoints such as
trace_kvm_sys_access(), if not set correctly, we need to analyze the
inaccurate name to get the exact name (which also is inconsistent with
other system registers), and also the inaccurate name occupies more space.
To fix the issue, use the name as a input parameter of PMU_SYS_REG like
MTE_REG or EL2_REG.
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1689305920-170523-1-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
The PMU event ID varies from 10 to 16 bits, depending on the PMU
version. If the PMU only supports 10 bits of event ID, bits [15:10] of
the evtCount field behave as RES0.
While the actual PMU emulation code gets this right (i.e. RES0 bits are
masked out when programming the perf event), the sysreg emulation writes
an unmasked value to the in-memory cpu context. The net effect is that
guest reads and writes of PMEVTYPER<n>_EL0 will see non-RES0 behavior in
the reserved bits of the field.
As it so happens, kvm_pmu_set_counter_event_type() already writes a
masked value to the in-memory context that gets overwritten by
access_pmu_evtyper(). Fix the issue by removing the unnecessary (and
incorrect) register write in access_pmu_evtyper().
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Reiji Watanabe <reijiw@google.com>
Link: https://lore.kernel.org/r/20230713221649.3889210-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Xiang reports that VMs occasionally fail to boot on GICv4.1 systems when
running a preemptible kernel, as it is possible that a vCPU is blocked
without requesting a doorbell interrupt.
The issue is that any preemption that occurs between vgic_v4_put() and
schedule() on the block path will mark the vPE as nonresident and *not*
request a doorbell irq. This occurs because when the vcpu thread is
resumed on its way to block, vcpu_load() will make the vPE resident
again. Once the vcpu actually blocks, we don't request a doorbell
anymore, and the vcpu won't be woken up on interrupt delivery.
Fix it by tracking that we're entering WFI, and key the doorbell
request on that flag. This allows us not to make the vPE resident
when going through a preempt/schedule cycle, meaning we don't lose
any state.
Cc: stable@vger.kernel.org
Fixes: 8e01d9a396e6 ("KVM: arm64: vgic-v4: Move the GICv4 residency flow to be driven by vcpu_load/put")
Reported-by: Xiang Chen <chenxiang66@hisilicon.com>
Suggested-by: Zenghui Yu <yuzenghui@huawei.com>
Tested-by: Xiang Chen <chenxiang66@hisilicon.com>
Co-developed-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Acked-by: Zenghui Yu <yuzenghui@huawei.com>
Link: https://lore.kernel.org/r/20230713070657.3873244-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Some bti instructions were missing from
commit b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
1) kvm_host_psci_cpu_entry
kvm_host_psci_cpu_entry is called from __kvm_hyp_init_cpu through "br"
instruction as __kvm_hyp_init_cpu resides in idmap section while
kvm_host_psci_cpu_entry is in hyp .text so the offset is larger than
128MB range covered by "b".
Which means that this function should start with "bti j" instruction.
LLVM which is the only compiler supporting BTI for Linux, adds "bti j"
for jump tables or by when taking the address of the block [1].
Same behaviour is observed with GCC.
As kvm_host_psci_cpu_entry is a C function, this must be done in
assembly.
Another solution is to use X16/X17 with "br", as according to ARM
ARM DDI0487I.a RLJHCL/IGMGRS, PACIASP has an implicit branch
target identification instruction that is compatible with
PSTATE.BTYPE 0b01 which includes "br X16/X17"
And the kvm_host_psci_cpu_entry has PACIASP as it is an external
function.
Although, using explicit "bti" makes it more clear than relying on
which register is used.
A third solution is to clear SCTLR_EL2.BT, which would make PACIASP
compatible PSTATE.BTYPE 0b11 ("br" to other registers).
However this deviates from the kernel behaviour (in bti_enable()).
2) Spectre vector table
"br" instructions are generated at runtime for the vector table
(__bp_harden_hyp_vecs).
These branches would land on vectors in __kvm_hyp_vector at offset 8.
As all the macros are defined with valid_vect/invalid_vect, it is
sufficient to add "bti j" at the correct offset.
[1] https://reviews.llvm.org/D52867
Fixes: b53d4a272349 ("KVM: arm64: Use BTI for nvhe")
Signed-off-by: Mostafa Saleh <smostafa@google.com>
Reported-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Link: https://lore.kernel.org/r/20230706152240.685684-1-smostafa@google.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Userspace is allowed to select any PAGE_SIZE aligned hva to back guest
memory. This is even the case with hugepages, although it is a rather
suboptimal configuration as PTE level mappings are used at stage-2.
The arm64 page aging handlers have an assumption that the specified
range is exactly one page/block of memory, which in the aforementioned
case is not necessarily true. All together this leads to the WARN() in
kvm_age_gfn() firing.
However, the WARN is only part of the issue as the table walkers visit
at most a single leaf PTE. For hugepage-backed memory in a memslot that
isn't hugepage-aligned, page aging entirely misses accesses to the
hugepage beyond the first page in the memslot.
Add a new walker dedicated to handling page aging MMU notifiers capable
of walking a range of PTEs. Convert kvm(_test)_age_gfn() over to the new
walker and drop the WARN that caught the issue in the first place. The
implementation of this walker was inspired by the test_clear_young()
implementation by Yu Zhao [*], but repurposed to address a bug in the
existing aging implementation.
Cc: stable@vger.kernel.org # v5.15
Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
Link: https://lore.kernel.org/kvmarm/20230526234435.662652-6-yuzhao@google.com/
Co-developed-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reported-by: Reiji Watanabe <reijiw@google.com>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Shaoqin Huang <shahuang@redhat.com>
Link: https://lore.kernel.org/r/20230627235405.4069823-1-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Since 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect
kvm_usage_count with kvm_lock"), hotplugging back a CPU whilst
a guest is running results in a number of ugly splats as most
of this code expects to run with preemption disabled, which isn't
the case anymore.
While the context is preemptable, it isn't migratable, which should
be enough. But we have plenty of preemptible() checks all over
the place, and our per-CPU accessors also disable preemption.
Since this affects released versions, let's do the easy fix first,
disabling preemption in kvm_arch_hardware_enable(). We can always
revisit this with a more invasive fix in the future.
Fixes: 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Reported-by: Kristina Martsenko <kristina.martsenko@arm.com>
Tested-by: Kristina Martsenko <kristina.martsenko@arm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/aeab7562-2d39-e78e-93b1-4711f8cc3fa5@arm.com
Cc: stable@vger.kernel.org # v6.3, v6.4
Link: https://lore.kernel.org/r/20230703163548.1498943-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
Currently there is no synchronisation between finalize_pkvm() and
kvm_arm_init() initcalls. The finalize_pkvm() proceeds happily even if
kvm_arm_init() fails resulting in the following warning on all the CPUs
and eventually a HYP panic:
| kvm [1]: IPA Size Limit: 48 bits
| kvm [1]: Failed to init hyp memory protection
| kvm [1]: error initializing Hyp mode: -22
|
| <snip>
|
| WARNING: CPU: 0 PID: 0 at arch/arm64/kvm/pkvm.c:226 _kvm_host_prot_finalize+0x30/0x50
| Modules linked in:
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.4.0 #237
| Hardware name: FVP Base RevC (DT)
| pstate: 634020c5 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
| pc : _kvm_host_prot_finalize+0x30/0x50
| lr : __flush_smp_call_function_queue+0xd8/0x230
|
| Call trace:
| _kvm_host_prot_finalize+0x3c/0x50
| on_each_cpu_cond_mask+0x3c/0x6c
| pkvm_drop_host_privileges+0x4c/0x78
| finalize_pkvm+0x3c/0x5c
| do_one_initcall+0xcc/0x240
| do_initcall_level+0x8c/0xac
| do_initcalls+0x54/0x94
| do_basic_setup+0x1c/0x28
| kernel_init_freeable+0x100/0x16c
| kernel_init+0x20/0x1a0
| ret_from_fork+0x10/0x20
| Failed to finalize Hyp protection: -22
| dtb=fvp-base-revc.dtb
| kvm [95]: nVHE hyp BUG at: arch/arm64/kvm/hyp/nvhe/mem_protect.c:540!
| kvm [95]: nVHE call trace:
| kvm [95]: [<ffff800081052984>] __kvm_nvhe_hyp_panic+0xac/0xf8
| kvm [95]: [<ffff800081059644>] __kvm_nvhe_handle_host_mem_abort+0x1a0/0x2ac
| kvm [95]: [<ffff80008105511c>] __kvm_nvhe_handle_trap+0x4c/0x160
| kvm [95]: [<ffff8000810540fc>] __kvm_nvhe___skip_pauth_save+0x4/0x4
| kvm [95]: ---[ end nVHE call trace ]---
| kvm [95]: Hyp Offset: 0xfffe8db00ffa0000
| Kernel panic - not syncing: HYP panic:
| PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800
| FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000
| VCPU:0000000000000000
| CPU: 3 PID: 95 Comm: kworker/u16:2 Tainted: G W 6.4.0 #237
| Hardware name: FVP Base RevC (DT)
| Workqueue: rpciod rpc_async_schedule
| Call trace:
| dump_backtrace+0xec/0x108
| show_stack+0x18/0x2c
| dump_stack_lvl+0x50/0x68
| dump_stack+0x18/0x24
| panic+0x138/0x33c
| nvhe_hyp_panic_handler+0x100/0x184
| new_slab+0x23c/0x54c
| ___slab_alloc+0x3e4/0x770
| kmem_cache_alloc_node+0x1f0/0x278
| __alloc_skb+0xdc/0x294
| tcp_stream_alloc_skb+0x2c/0xf0
| tcp_sendmsg_locked+0x3d0/0xda4
| tcp_sendmsg+0x38/0x5c
| inet_sendmsg+0x44/0x60
| sock_sendmsg+0x1c/0x34
| xprt_sock_sendmsg+0xdc/0x274
| xs_tcp_send_request+0x1ac/0x28c
| xprt_transmit+0xcc/0x300
| call_transmit+0x78/0x90
| __rpc_execute+0x114/0x3d8
| rpc_async_schedule+0x28/0x48
| process_one_work+0x1d8/0x314
| worker_thread+0x248/0x474
| kthread+0xfc/0x184
| ret_from_fork+0x10/0x20
| SMP: stopping secondary CPUs
| Kernel Offset: 0x57c5cb460000 from 0xffff800080000000
| PHYS_OFFSET: 0x80000000
| CPU features: 0x00000000,1035b7a3,ccfe773f
| Memory Limit: none
| ---[ end Kernel panic - not syncing: HYP panic:
| PS:a34023c9 PC:0000f250710b973c ESR:00000000f2000800
| FAR:ffff000800cb00d0 HPFAR:000000000880cb00 PAR:0000000000000000
| VCPU:0000000000000000 ]---
Fix it by checking for the successfull initialisation of kvm_arm_init()
in finalize_pkvm() before proceeding any futher.
Fixes: 87727ba2bb05 ("KVM: arm64: Ensure CPU PMU probes before pKVM host de-privilege")
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Zenghui Yu <yuzenghui@huawei.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20230704193243.3300506-1-sudeep.holla@arm.com
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
It recently appeared that, when running VHE, there is a notable
difference between using CNTKCTL_EL1 and CNTHCTL_EL2, despite what
the architecture documents:
- When accessed from EL2, bits [19:18] and [16:10] of CNTKCTL_EL1 have
the same assignment as CNTHCTL_EL2
- When accessed from EL1, bits [19:18] and [16:10] are RES0
It is all OK, until you factor in NV, where the EL2 guest runs at EL1.
In this configuration, CNTKCTL_EL11 doesn't trap, nor ends up in
the VNCR page. This means that any write from the guest affecting
CNTHCTL_EL2 using CNTKCTL_EL1 ends up losing some state. Not good.
The fix it obvious: don't use CNTKCTL_EL1 if you want to change bits
that are not part of the EL1 definition of CNTKCTL_EL1, and use
CNTHCTL_EL2 instead. This doesn't change anything for a bare-metal OS,
and fixes it when running under NV. The NV hypervisor will itself
have to work harder to merge the two accessors.
Note that there is a pending update to the architecture to address
this issue by making the affected bits UNKNOWN when CNTKCTL_EL1 is
used from EL2 with VHE enabled.
Fixes: c605ee245097 ("KVM: arm64: timers: Allow physical offset without CNTPOFF_EL2")
Signed-off-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org # v6.4
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Link: https://lore.kernel.org/r/20230627140557.544885-1-maz@kernel.org
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
|
|
|
|
We just sorted the entries and fields last release, so just out of a
perverse sense of curiosity, I decided to see if we can keep things
ordered for even just one release.
The answer is "No. No we cannot".
I suggest that all kernel developers will need weekly training sessions,
involving a lot of Big Bird and Sesame Street. And at the yearly
maintainer summit, we will all sing the alphabet song together.
I doubt I will keep doing this. At some point "perverse sense of
curiosity" turns into just a cold dark place filled with sadness and
despair.
Repeats: 80e62bc8487b ("MAINTAINERS: re-sort all entries and fields")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:
- swiotlb area sizing fixes (Petr Tesarik)
* tag 'dma-mapping-6.5-2023-07-09' of git://git.infradead.org/users/hch/dma-mapping:
swiotlb: reduce the number of areas to match actual memory pool size
swiotlb: always set the number of areas before allocating the pool
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq update from Borislav Petkov:
- Optimize IRQ domain's name assignment
* tag 'irq_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Use return value of strreplace()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fpu fix from Borislav Petkov:
- Do FPU AP initialization on Xen PV too which got missed by the recent
boot reordering work
* tag 'x86_urgent_for_v6.5_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/xen: Fix secondary processors' FPU initialization
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
"A single fix for the mechanism to park CPUs with an INIT IPI.
On shutdown or kexec, the kernel tries to park the non-boot CPUs with
an INIT IPI. But the same code path is also used by the crash utility.
If the CPU which panics is not the boot CPU then it sends an INIT IPI
to the boot CPU which resets the machine.
Prevent this by validating that the CPU which runs the stop mechanism
is the boot CPU. If not, leave the other CPUs in HLT"
* tag 'x86-core-2023-07-09' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/smp: Don't send INIT to boot CPU
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- fixes for KVM
- fix for loongson build and cpu probing
- DT fixes
* tag 'mips_6.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: kvm: Fix build error with KVM_MIPS_DEBUG_COP0_COUNTERS enabled
MIPS: dts: add missing space before {
MIPS: Loongson: Fix build error when make modules_install
MIPS: KVM: Fix NULL pointer dereference
MIPS: Loongson: Fix cpu_probe_loongson() again
|
|
Pull xfs fix from Darrick Wong:
"Nothing exciting here, just getting rid of a gcc warning that I got
tired of seeing when I turn on gcov"
* tag 'xfs-6.5-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix uninit warning in xfs_growfs_data
|
|
git://git.samba.org/sfrench/cifs-2.6
Pull more smb client updates from Steve French:
- fix potential use after free in unmount
- minor cleanup
- add worker to cleanup stale directory leases
* tag '6.5-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add a laundromat thread for cached directories
smb: client: remove redundant pointer 'server'
cifs: fix session state transition to avoid use-after-free issue
|
|
Pull NTB updates from Jon Mason:
"Fixes for pci_clean_master, error handling in driver inits, and
various other issues/bugs"
* tag 'ntb-6.5' of https://github.com/jonmason/ntb:
ntb: hw: amd: Fix debugfs_create_dir error checking
ntb.rst: Fix copy and paste error
ntb_netdev: Fix module_init problem
ntb: intel: Remove redundant pci_clear_master
ntb: epf: Remove redundant pci_clear_master
ntb_hw_amd: Remove redundant pci_clear_master
ntb: idt: drop redundant pci_enable_pcie_error_reporting()
MAINTAINERS: git://github -> https://github.com for jonmason
NTB: EPF: fix possible memory leak in pci_vntb_probe()
NTB: ntb_tool: Add check for devm_kcalloc
NTB: ntb_transport: fix possible memory leak while device_register() fails
ntb: intel: Fix error handling in intel_ntb_pci_driver_init()
NTB: amd: Fix error handling in amd_ntb_pci_driver_init()
ntb: idt: Fix error handling in idt_pci_driver_init()
|
|
Lockdep is certainly right to complain about
(&vma->vm_lock->lock){++++}-{3:3}, at: vma_start_write+0x2d/0x3f
but task is already holding lock:
(&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db
Invert those to the usual ordering.
Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after it becomes visible")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull hotfixes from Andrew Morton:
"16 hotfixes. Six are cc:stable and the remainder address post-6.4
issues"
The merge undoes the disabling of the CONFIG_PER_VMA_LOCK feature, since
it was all hopefully fixed in mainline.
* tag 'mm-hotfixes-stable-2023-07-08-10-43' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib: dhry: fix sleeping allocations inside non-preemptable section
kasan, slub: fix HW_TAGS zeroing with slub_debug
kasan: fix type cast in memory_is_poisoned_n
mailmap: add entries for Heiko Stuebner
mailmap: update manpage link
bootmem: remove the vmemmap pages from kmemleak in free_bootmem_page
MAINTAINERS: add linux-next info
mailmap: add Markus Schneider-Pargmann
writeback: account the number of pages written back
mm: call arch_swap_restore() from do_swap_page()
squashfs: fix cache race with migration
mm/hugetlb.c: fix a bug within a BUG(): inconsistent pte comparison
docs: update ocfs2-devel mailing list address
MAINTAINERS: update ocfs2-devel mailing list address
mm: disable CONFIG_PER_VMA_LOCK until its fixed
fork: lock VMAs of the parent process when forking
|
|
When forking a child process, the parent write-protects anonymous pages
and COW-shares them with the child being forked using copy_present_pte().
We must not take any concurrent page faults on the source vma's as they
are being processed, as we expect both the vma and the pte's behind it
to be stable. For example, the anon_vma_fork() expects the parents
vma->anon_vma to not change during the vma copy.
A concurrent page fault on a page newly marked read-only by the page
copy might trigger wp_page_copy() and a anon_vma_prepare(vma) on the
source vma, defeating the anon_vma_clone() that wasn't done because the
parent vma originally didn't have an anon_vma, but we now might end up
copying a pte entry for a page that has one.
Before the per-vma lock based changes, the mmap_lock guaranteed
exclusion with concurrent page faults. But now we need to do a
vma_start_write() to make sure no concurrent faults happen on this vma
while it is being processed.
This fix can potentially regress some fork-heavy workloads. Kernel
build time did not show noticeable regression on a 56-core machine while
a stress test mapping 10000 VMAs and forking 5000 times in a tight loop
shows ~5% regression. If such fork time regression is unacceptable,
disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
optimizations are possible if this regression proves to be problematic.
Suggested-by: David Hildenbrand <david@redhat.com>
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a34c@applied-asynchrony.com/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mmap_region adds a newly created VMA into VMA tree and might modify it
afterwards before dropping the mmap_lock. This poses a problem for page
faults handled under per-VMA locks because they don't take the mmap_lock
and can stumble on this VMA while it's still being modified. Currently
this does not pose a problem since post-addition modifications are done
only for file-backed VMAs, which are not handled under per-VMA lock.
However, once support for handling file-backed page faults with per-VMA
locks is added, this will become a race.
Fix this by write-locking the VMA before inserting it into the VMA tree.
Other places where a new VMA is added into VMA tree do not modify it
after the insertion, so do not need the same locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With recent changes necessitating mmap_lock to be held for write while
expanding a stack, per-VMA locks should follow the same rules and be
write-locked to prevent page faults into the VMA being expanded. Add
the necessary locking.
Cc: stable@vger.kernel.org
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Pull more SCSI updates from James Bottomley:
"A few late arriving patches that missed the initial pull request. It's
mostly bug fixes (the dt-bindings is a fix for the initial pull)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Remove unused function declaration
scsi: target: docs: Remove tcm_mod_builder.py
scsi: target: iblock: Quiet bool conversion warning with pr_preempt use
scsi: dt-bindings: ufs: qcom: Fix ICE phandle
scsi: core: Simplify scsi_cdl_check_cmd()
scsi: isci: Fix comment typo
scsi: smartpqi: Replace one-element arrays with flexible-array members
scsi: target: tcmu: Replace strlcpy() with strscpy()
scsi: ncr53c8xx: Replace strlcpy() with strscpy()
scsi: lpfc: Fix lpfc_name struct packing
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull more i2c updates from Wolfram Sang:
- xiic patch should have been in the original pull but slipped through
- mpc patch fixes a build regression
- nomadik cleanup
* tag 'i2c-for-6.5-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: mpc: Drop unused variable
i2c: nomadik: Remove a useless call in the remove function
i2c: xiic: Don't try to handle more interrupt events after error
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- Check for NULL bdev in LoadPin (Matthias Kaehlcke)
- Revert unwanted KUnit FORTIFY build default
- Fix 1-element array causing boot warnings with xhci-hub
* tag 'hardening-v6.5-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
usb: ch9: Replace bmSublinkSpeedAttr 1-element array with flexible array
Revert "fortify: Allow KUnit test to build without FORTIFY"
dm: verity-loadpin: Add NULL pointer check for 'bdev' parameter
|
|
The debugfs_create_dir function returns ERR_PTR in case of error, and the
only correct way to check if an error occurred is 'IS_ERR' inline function.
This patch will replace the null-comparison with IS_ERR.
Signed-off-by: Anup Sharma <anupnewsmail@gmail.com>
Suggested-by: Ivan Orlov <ivan.orlov0322@gmail.com>
Signed-off-by: Jon Mason <jdmason@kudzu.us>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next
Pull more perf tools updates from Namhyung Kim:
"These are remaining changes and fixes for this cycle.
Build:
- Allow generating vmlinux.h from BTF using `make GEN_VMLINUX_H=1`
and skip if the vmlinux has no BTF.
- Replace deprecated clang -target xxx option by --target=xxx.
perf record:
- Print event attributes with well known type and config symbols in
the debug output like below:
# perf record -e cycles,cpu-clock -C0 -vv true
<SNIP>
------------------------------------------------------------
perf_event_attr:
type 0 (PERF_TYPE_HARDWARE)
size 136
config 0 (PERF_COUNT_HW_CPU_CYCLES)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8 = 5
------------------------------------------------------------
perf_event_attr:
type 1 (PERF_TYPE_SOFTWARE)
size 136
config 0 (PERF_COUNT_SW_CPU_CLOCK)
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|CPU|PERIOD|IDENTIFIER
read_format ID
disabled 1
inherit 1
freq 1
sample_id_all 1
exclude_guest 1
- Update AMD IBS event error message since it now support per-process
profiling but no priviledge filters.
$ sudo perf record -e ibs_op//k -C 0
Error:
AMD IBS doesn't support privilege filtering. Try again without
the privilege modifiers (like 'k') at the end.
perf lock contention:
- Support CSV style output using -x option
$ sudo perf lock con -ab -x, sleep 1
# output: contended, total wait, max wait, avg wait, type, caller
19, 194232, 21415, 10222, spinlock, process_one_work+0x1f0
15, 162748, 23843, 10849, rwsem:R, do_user_addr_fault+0x40e
4, 86740, 23415, 21685, rwlock:R, ep_poll_callback+0x2d
1, 84281, 84281, 84281, mutex, iwl_mvm_async_handlers_wk+0x135
8, 67608, 27404, 8451, spinlock, __queue_work+0x174
3, 58616, 31125, 19538, rwsem:W, do_mprotect_pkey+0xff
3, 52953, 21172, 17651, rwlock:W, do_epoll_wait+0x248
2, 30324, 19704, 15162, rwsem:R, do_madvise+0x3ad
1, 24619, 24619, 24619, spinlock, rcu_core+0xd4
- Add --output option to save the data to a file not to be interfered
by other debug messages.
Test:
- Fix event parsing test on ARM where there's no raw PMU nor supports
PERF_PMU_CAP_EXTENDED_HW_TYPE.
- Update the lock contention test case for CSV output.
- Fix a segfault in the daemon command test.
Vendor events (JSON):
- Add has_event() to check if the given event is available on system
at runtime. On Intel machines, some transaction events may not be
present when TSC extensions are disabled.
- Update Intel event metrics.
Misc:
- Sort symbols by name using an external array of pointers instead of
a rbtree node in the symbol. This will save 16-bytes or 24-bytes
per symbol whether the sorting is actually requested or not.
- Fix unwinding DWARF callstacks using libdw when --symfs option is
used"
* tag 'perf-tools-for-v6.5-2-2023-07-06' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next: (38 commits)
perf test: Fix event parsing test when PERF_PMU_CAP_EXTENDED_HW_TYPE isn't supported.
perf test: Fix event parsing test on Arm
perf evsel amd: Fix IBS error message
perf: unwind: Fix symfs with libdw
perf symbol: Fix uninitialized return value in symbols__find_by_name()
perf test: Test perf lock contention CSV output
perf lock contention: Add --output option
perf lock contention: Add -x option for CSV style output
perf lock: Remove stale comments
perf vendor events intel: Update tigerlake to 1.13
perf vendor events intel: Update skylakex to 1.31
perf vendor events intel: Update skylake to 57
perf vendor events intel: Update sapphirerapids to 1.14
perf vendor events intel: Update icelakex to 1.21
perf vendor events intel: Update icelake to 1.19
perf vendor events intel: Update cascadelakex to 1.19
perf vendor events intel: Update meteorlake to 1.03
perf vendor events intel: Add rocketlake events/metrics
perf vendor metrics intel: Make transaction metrics conditional
perf jevents: Support for has_event function
...
|
|
Pull bitmap updates from Yury Norov:
"Fixes for different bitmap pieces:
- lib/test_bitmap: increment failure counter properly
The tests that don't use expect_eq() macro to determine that a test
is failured must increment failed_tests explicitly.
- lib/bitmap: drop optimization of bitmap_{from,to}_arr64
bitmap_{from,to}_arr64() optimization is overly optimistic
on 32-bit LE architectures when it's wired to
bitmap_copy_clear_tail().
- nodemask: Drop duplicate check in for_each_node_mask()
As the return value type of first_node() became unsigned, the node
>= 0 became unnecessary.
- cpumask: fix function description kernel-doc notation
- MAINTAINERS: Add bits.h and bitfield.h to the BITMAP API record
Add linux/bits.h and linux/bitfield.h for visibility"
* tag 'bitmap-6.5-rc1' of https://github.com/norov/linux:
MAINTAINERS: Add bitfield.h to the BITMAP API record
MAINTAINERS: Add bits.h to the BITMAP API record
cpumask: fix function description kernel-doc notation
nodemask: Drop duplicate check in for_each_node_mask()
lib/bitmap: drop optimization of bitmap_{from,to}_arr64
lib/test_bitmap: increment failure counter properly
|
|
The Smatch static checker reports the following warnings:
lib/dhry_run.c:38 dhry_benchmark() warn: sleeping in atomic context
lib/dhry_run.c:43 dhry_benchmark() warn: sleeping in atomic context
Indeed, dhry() does sleeping allocations inside the non-preemptable
section delimited by get_cpu()/put_cpu().
Fix this by using atomic allocations instead.
Add error handling, as atomic these allocations may fail.
Link: https://lkml.kernel.org/r/bac6d517818a7cd8efe217c1ad649fffab9cc371.1688568764.git.geert+renesas@glider.be
Fixes: 13684e966d46283e ("lib: dhry: fix unstable smp_processor_id(_) usage")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/0469eb3a-02eb-4b41-b189-de20b931fa56@moroto.mountain
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated
kmalloc space than requested") added precise kmalloc redzone poisoning to
the slub_debug functionality.
However, this commit didn't account for HW_TAGS KASAN fully initializing
the object via its built-in memory initialization feature. Even though
HW_TAGS KASAN memory initialization contains special memory initialization
handling for when slub_debug is enabled, it does not account for in-object
slub_debug redzones. As a result, HW_TAGS KASAN can overwrite these
redzones and cause false-positive slub_debug reports.
To fix the issue, avoid HW_TAGS KASAN memory initialization when
slub_debug is enabled altogether. Implement this by moving the
__slub_debug_enabled check to slab_post_alloc_hook. Common slab code
seems like a more appropriate place for a slub_debug check anyway.
Link: https://lkml.kernel.org/r/678ac92ab790dba9198f9ca14f405651b97c8502.1688561016.git.andreyknvl@google.com
Fixes: 946fa0dbf2d8 ("mm/slub: extend redzone check to extra allocated kmalloc space than requested")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Will Deacon <will@kernel.org>
Acked-by: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: kasan-dev@googlegroups.com
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13
builtins") introduced a bug into the memory_is_poisoned_n implementation:
it effectively removed the cast to a signed integer type after applying
KASAN_GRANULE_MASK.
As a result, KASAN started failing to properly check memset, memcpy, and
other similar functions.
Fix the bug by adding the cast back (through an additional signed integer
variable to make the code more readable).
Link: https://lkml.kernel.org/r/8c9e0251c2b8b81016255709d4ec42942dcaf018.1688431866.git.andreyknvl@google.com
Fixes: bb6e04a173f0 ("kasan: use internal prototypes matching gcc-13 builtins")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
I am going to lose my vrull.eu address at the end of july, and while
adding it to mailmap I also realised that there are more old addresses
from me dangling, so update .mailmap for all of them.
Link: https://lkml.kernel.org/r/20230704163919.1136784-3-heiko@sntech.de
Signed-off-by: Heiko Stuebner <heiko@sntech.de>
Signed-off-by: Heiko Stuebner <heiko.stuebner@vrull.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Update .mailmap for my work address and fix manpage".
While updating mailmap for the going-away address, I also found that on
current systems the manpage linked from the header comment changed.
And in fact it looks like the git mailmap feature got its own manpage.
This patch (of 2):
On recent systems the git-shortlog manpage only tells people to
See gitmailmap(5)
So instead of sending people on a scavenger hunt, put that info into the
header directly. Though keep the old reference around for older systems.
Link: https://lkml.kernel.org/r/20230704163919.1136784-1-heiko@sntech.de
Link: https://lkml.kernel.org/r/20230704163919.1136784-2-heiko@sntech.de
Signed-off-by: Heiko Stuebner <heiko.stuebner@vrull.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem") fix an overlaps existing problem of kmemleak. But the
problem still existed when HAVE_BOOTMEM_INFO_NODE is disabled, because in
this case, free_bootmem_page() will call free_reserved_page() directly.
Fix the problem by adding kmemleak_free_part() in free_bootmem_page() when
HAVE_BOOTMEM_INFO_NODE is disabled.
Link: https://lkml.kernel.org/r/20230704101942.2819426-1-liushixin2@huawei.com
Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add linux-next info to MAINTAINERS for ease of finding this data.
Link: https://lkml.kernel.org/r/20230704054410.12527-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add my old mail address and update my name.
Link: https://lkml.kernel.org/r/20230628081341.3470229-1-msp@baylibre.com
Signed-off-by: Markus Schneider-Pargmann <msp@baylibre.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
nr_to_write is a count of pages, so we need to decrease it by the number
of pages in the folio we just wrote, not by 1. Most callers specify
either LONG_MAX or 1, so are unaffected, but writeback_sb_inodes() might
end up writing 512x as many pages as it asked for.
Dave added:
: XFS is the only filesystem this would affect, right? AFAIA, nothing
: else enables large folios and uses writeback through
: write_cache_pages() at this point...
:
: In which case, I'd be surprised if much difference, if any, gets
: noticed by anyone.
Link: https://lkml.kernel.org/r/20230628185548.981888-1-willy@infradead.org
Fixes: 793917d997df ("mm/readahead: Add large folio readahead")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Commit c145e0b47c77 ("mm: streamline COW logic in do_swap_page()") moved
the call to swap_free() before the call to set_pte_at(), which meant that
the MTE tags could end up being freed before set_pte_at() had a chance to
restore them. Fix it by adding a call to the arch_swap_restore() hook
before the call to swap_free().
Link: https://lkml.kernel.org/r/20230523004312.1807357-2-pcc@google.com
Link: https://linux-review.googlesource.com/id/I6470efa669e8bd2f841049b8c61020c510678965
Fixes: c145e0b47c77 ("mm: streamline COW logic in do_swap_page()")
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reported-by: Qun-wei Lin <Qun-wei.Lin@mediatek.com>
Closes: https://lore.kernel.org/all/5050805753ac469e8d727c797c2218a9d780d434.camel@mediatek.com/
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Migration replaces the page in the mapping before copying the contents and
the flags over from the old page, so check that the page in the page cache
is really up to date before using it. Without this, stressing squashfs
reads with parallel compaction sometimes results in squashfs reporting
data corruption.
Link: https://lkml.kernel.org/r/20230629-squashfs-cache-migration-v1-1-d50ebe55099d@axis.com
Fixes: e994f5b677ee ("squashfs: cache partial compressed blocks")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The following crash happens for me when running the -mm selftests (below).
Specifically, it happens while running the uffd-stress subtests:
kernel BUG at mm/hugetlb.c:7249!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 3238 Comm: uffd-stress Not tainted 6.4.0-hubbard-github+ #109
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018
RIP: 0010:huge_pte_alloc+0x12c/0x1a0
...
Call Trace:
<TASK>
? __die_body+0x63/0xb0
? die+0x9f/0xc0
? do_trap+0xab/0x180
? huge_pte_alloc+0x12c/0x1a0
? do_error_trap+0xc6/0x110
? huge_pte_alloc+0x12c/0x1a0
? handle_invalid_op+0x2c/0x40
? huge_pte_alloc+0x12c/0x1a0
? exc_invalid_op+0x33/0x50
? asm_exc_invalid_op+0x16/0x20
? __pfx_put_prev_task_idle+0x10/0x10
? huge_pte_alloc+0x12c/0x1a0
hugetlb_fault+0x1a3/0x1120
? finish_task_switch+0xb3/0x2a0
? lock_is_held_type+0xdb/0x150
handle_mm_fault+0xb8a/0xd40
? find_vma+0x5d/0xa0
do_user_addr_fault+0x257/0x5d0
exc_page_fault+0x7b/0x1f0
asm_exc_page_fault+0x22/0x30
That happens because a BUG() statement in huge_pte_alloc() attempts to
check that a pte, if present, is a hugetlb pte, but it does so in a
non-lockless-safe manner that leads to a false BUG() report.
We got here due to a couple of bugs, each of which by itself was not quite
enough to cause a problem:
First of all, before commit c33c794828f2("mm: ptep_get() conversion"), the
BUG() statement in huge_pte_alloc() was itself fragile: it relied upon
compiler behavior to only read the pte once, despite using it twice in the
same conditional.
Next, commit c33c794828f2 ("mm: ptep_get() conversion") broke that
delicate situation, by causing all direct pte reads to be done via
READ_ONCE(). And so READ_ONCE() got called twice within the same BUG()
conditional, leading to comparing (potentially, occasionally) different
versions of the pte, and thus to false BUG() reports.
Fix this by taking a single snapshot of the pte before using it in the
BUG conditional.
Now, that commit is only partially to blame here but, people doing
bisections will invariably land there, so this will help them find a fix
for a real crash. And also, the previous behavior was unlikely to ever
expose this bug--it was fragile, yet not actually broken.
So that's why I chose this commit for the Fixes tag, rather than the
commit that created the original BUG() statement.
Link: https://lkml.kernel.org/r/20230701010442.2041858-1-jhubbard@nvidia.com
Fixes: c33c794828f2 ("mm: ptep_get() conversion")
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: James Houghton <jthoughton@google.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The ocfs2-devel mailing list has been migrated to the kernel.org
infrastructure, update all related documentation pointers to reflect the
change.
Link: https://lkml.kernel.org/r/20230628013437.47030-3-ailiop@suse.com
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The ocfs2-devel mailing list has been migrated to the kernel.org
infrastructure, update the related entry to reflect the change.
Link: https://lkml.kernel.org/r/20230628013437.47030-2-ailiop@suse.com
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
A memory corruption was reported in [1] with bisection pointing to the
patch [2] enabling per-VMA locks for x86. Disable per-VMA locks config to
prevent this issue until the fix is confirmed. This is expected to be a
temporary measure.
[1] https://bugzilla.kernel.org/show_bug.cgi?id=217624
[2] https://lore.kernel.org/all/20230227173632.3292573-30-surenb@google.com
Link: https://lkml.kernel.org/r/20230706011400.2949242-3-surenb@google.com
Reported-by: Jiri Slaby <jirislaby@kernel.org>
Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf51b@kernel.org/
Reported-by: Jacob Young <jacobly.alt@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624
Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Holger Hoffstätte <holger@applied-asynchrony.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|