kernel/linux.git/arch/arm64/include/asm/memory.h, branch v6.18.22

kasan/hw-tags: introduce kasan.write_only option

2025-09-21T21:22:10+00:00

Patch series "introduce kasan.write_only option in hw-tags", v8. Hardware tag based KASAN is implemented using the Memory Tagging Extension (MTE) feature. MTE is built on top of the ARMv8.0 virtual address tagging TBI (Top Byte Ignore) feature and allows software to access a 4-bit allocation tag for each 16-byte granule in the physical address space. A logical tag is derived from bits 59-56 of the virtual address used for the memory access. A CPU with MTE enabled will compare the logical tag against the allocation tag and potentially raise an tag check fault on mismatch, subject to system registers configuration. Since ARMv8.9, FEAT_MTE_STORE_ONLY can be used to restrict raise of tag check fault on store operation only. Using this feature (FEAT_MTE_STORE_ONLY), introduce KASAN write-only mode which restricts KASAN check write (store) operation only. This mode omits KASAN check for read (fetch/load) operation. Therefore, it might be used not only debugging purpose but also in normal environment. This patch (of 2): Since Armv8.9, FEATURE_MTE_STORE_ONLY feature is introduced to restrict raise of tag check fault on store operation only. Introduce KASAN write only mode based on this feature. KASAN write only mode restricts KASAN checks operation for write only and omits the checks for fetch/read operations when accessing memory. So it might be used not only debugging enviroment but also normal enviroment to check memory safty. This features can be controlled with "kasan.write_only" arguments. When "kasan.write_only=on", KASAN checks write operation only otherwise KASAN checks all operations. This changes the MTE_STORE_ONLY feature as BOOT_CPU_FEATURE like ARM64_MTE_ASYMM so that makes it initialise in kasan_init_hw_tags() with other function together. Link: https://lkml.kernel.org/r/20250916222755.466009-1-yeoreum.yun@arm.com Link: https://lkml.kernel.org/r/20250916222755.466009-2-yeoreum.yun@arm.com Signed-off-by: Yeoreum Yun Reviewed-by: Catalin Marinas Reviewed-by: Andrey Ryabinin Cc: Alexander Potapenko Cc: Andrey Konovalov Cc: Ard Biesheuvel Cc: Breno Leitao Cc: David Hildenbrand Cc: Dmitriy Vyukov Cc: D Scott Phillips Cc: Hardevsinh Palaniya Cc: James Morse Cc: John Hubbard Cc: Jonathan Corbet Cc: Kalesh Singh Cc: levi.yun Cc: Marc Zyngier Cc: Mark Brown Cc: Oliver Upton Cc: Pankaj Gupta Cc: Vincenzo Frascino Cc: Will Deacon Cc: Yang Shi Signed-off-by: Andrew Morton

arm64: Remove CONFIG_VMAP_STACK conditionals from THREAD_SHIFT and THREAD_ALIGN

2025-07-08T12:41:08+00:00

Now that VMAP_STACK is always enabled on arm64, remove the CONFIG_VMAP_STACK conditional logic from the definitions of THREAD_SHIFT and THREAD_ALIGN in arch/arm64/include/asm/memory.h. This simplifies the code by unconditionally setting THREAD_ALIGN to (2 * THREAD_SIZE) and adjusting the THREAD_SHIFT definition to only depend on MIN_THREAD_SHIFT and PAGE_SHIFT. This change reflects the updated arm64 stack model, where all kernel threads use virtually mapped stacks with guard pages, and ensures alignment and stack sizing are consistently handled. Signed-off-by: Breno Leitao Acked-by: Ard Biesheuvel Acked-by: Mark Rutland Link: https://lore.kernel.org/r/20250707-arm64_vmap-v1-3-8de98ca0f91c@debian.org Signed-off-by: Will Deacon

arm64: kvm: Introduce nvhe stack size constants

2025-01-08T11:25:28+00:00

Refactor nvhe stack code to use NVHE_STACK_SIZE/SHIFT constants, instead of directly using PAGE_SIZE/SHIFT. This makes the code a bit easier to read, without introducing any functional changes. Cc: Marc Zyngier Cc: Mark Brown Cc: Mark Rutland Cc: Will Deacon Signed-off-by: Kalesh Singh Link: https://lore.kernel.org/r/20241112003336.1375584-1-kaleshsingh@google.com Signed-off-by: Marc Zyngier

Merge tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

2024-11-23T17:58:07+00:00

Pull MM updates from Andrew Morton: - The series "zram: optimal post-processing target selection" from Sergey Senozhatsky improves zram's post-processing selection algorithm. This leads to improved memory savings. - Wei Yang has gone to town on the mapletree code, contributing several series which clean up the implementation: - "refine mas_mab_cp()" - "Reduce the space to be cleared for maple_big_node" - "maple_tree: simplify mas_push_node()" - "Following cleanup after introduce mas_wr_store_type()" - "refine storing null" - The series "selftests/mm: hugetlb_fault_after_madv improvements" from David Hildenbrand fixes this selftest for s390. - The series "introduce pte_offset_map_{ro|rw}_nolock()" from Qi Zheng implements some rationaizations and cleanups in the page mapping code. - The series "mm: optimize shadow entries removal" from Shakeel Butt optimizes the file truncation code by speeding up the handling of shadow entries. - The series "Remove PageKsm()" from Matthew Wilcox completes the migration of this flag over to being a folio-based flag. - The series "Unify hugetlb into arch_get_unmapped_area functions" from Oscar Salvador implements a bunch of consolidations and cleanups in the hugetlb code. - The series "Do not shatter hugezeropage on wp-fault" from Dev Jain takes away the wp-fault time practice of turning a huge zero page into small pages. Instead we replace the whole thing with a THP. More consistent cleaner and potentiall saves a large number of pagefaults. - The series "percpu: Add a test case and fix for clang" from Andy Shevchenko enhances and fixes the kernel's built in percpu test code. - The series "mm/mremap: Remove extra vma tree walk" from Liam Howlett optimizes mremap() by avoiding doing things which we didn't need to do. - The series "Improve the tmpfs large folio read performance" from Baolin Wang teaches tmpfs to copy data into userspace at the folio size rather than as individual pages. A 20% speedup was observed. - The series "mm/damon/vaddr: Fix issue in damon_va_evenly_split_region()" fro Zheng Yejian fixes DAMON splitting. - The series "memcg-v1: fully deprecate charge moving" from Shakeel Butt removes the long-deprecated memcgv2 charge moving feature. - The series "fix error handling in mmap_region() and refactor" from Lorenzo Stoakes cleanup up some of the mmap() error handling and addresses some potential performance issues. - The series "x86/module: use large ROX pages for text allocations" from Mike Rapoport teaches x86 to use large pages for read-only-execute module text. - The series "page allocation tag compression" from Suren Baghdasaryan is followon maintenance work for the new page allocation profiling feature. - The series "page->index removals in mm" from Matthew Wilcox remove most references to page->index in mm/. A slow march towards shrinking struct page. - The series "damon/{self,kunit}tests: minor fixups for DAMON debugfs interface tests" from Andrew Paniakin performs maintenance work for DAMON's self testing code. - The series "mm: zswap swap-out of large folios" from Kanchana Sridhar improves zswap's batching of compression and decompression. It is a step along the way towards using Intel IAA hardware acceleration for this zswap operation. - The series "kasan: migrate the last module test to kunit" from Sabyrzhan Tasbolatov completes the migration of the KASAN built-in tests over to the KUnit framework. - The series "implement lightweight guard pages" from Lorenzo Stoakes permits userapace to place fault-generating guard pages within a single VMA, rather than requiring that multiple VMAs be created for this. Improved efficiencies for userspace memory allocators are expected. - The series "memcg: tracepoint for flushing stats" from JP Kobryn uses tracepoints to provide increased visibility into memcg stats flushing activity. - The series "zram: IDLE flag handling fixes" from Sergey Senozhatsky fixes a zram buglet which potentially affected performance. - The series "mm: add more kernel parameters to control mTHP" from Maíra Canal enhances our ability to control/configuremultisize THP from the kernel boot command line. - The series "kasan: few improvements on kunit tests" from Sabyrzhan Tasbolatov has a couple of fixups for the KASAN KUnit tests. - The series "mm/list_lru: Split list_lru lock into per-cgroup scope" from Kairui Song optimizes list_lru memory utilization when lockdep is enabled. * tag 'mm-stable-2024-11-18-19-27' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (215 commits) cma: enforce non-zero pageblock_order during cma_init_reserved_mem() mm/kfence: add a new kunit test test_use_after_free_read_nofault() zram: fix NULL pointer in comp_algorithm_show() memcg/hugetlb: add hugeTLB counters to memcg vmstat: call fold_vm_zone_numa_events() before show per zone NUMA event mm: mmap_lock: check trace_mmap_lock_$type_enabled() instead of regcount zram: ZRAM_DEF_COMP should depend on ZRAM MAINTAINERS/MEMORY MANAGEMENT: add document files for mm Docs/mm/damon: recommend academic papers to read and/or cite mm: define general function pXd_init() kmemleak: iommu/iova: fix transient kmemleak false positive mm/list_lru: simplify the list_lru walk callback function mm/list_lru: split the lock to per-cgroup scope mm/list_lru: simplify reparenting and initial allocation mm/list_lru: code clean up for reparenting mm/list_lru: don't export list_lru_add mm/list_lru: don't pass unnecessary key parameters kasan: add kunit tests for kmalloc_track_caller, kmalloc_node_track_caller kasan: change kasan_atomics kunit test as KUNIT_CASE_SLOW kasan: use EXPORT_SYMBOL_IF_KUNIT to export symbols ...

kaslr: rename physmem_end and PHYSMEM_END to direct_map_physmem_end

2024-11-07T04:11:11+00:00

For clarity. It's increasingly hard to reason about the code, when KASLR is moving around the boundaries. In this case where KASLR is randomizing the location of the kernel image within physical memory, the maximum number of address bits for physical memory has not changed. What has changed is the ending address of memory that is allowed to be directly mapped by the kernel. Let's name the variable, and the associated macro accordingly. Also, enhance the comment above the direct_map_physmem_end definition, to further clarify how this all works. Link: https://lkml.kernel.org/r/20241009025024.89813-1-jhubbard@nvidia.com Signed-off-by: John Hubbard Reviewed-by: Pankaj Gupta Acked-by: David Hildenbrand Acked-by: Will Deacon Reviewed-by: Mike Rapoport (Microsoft) Cc: Thomas Gleixner Cc: Alistair Popple Cc: Jordan Niethe Signed-off-by: Andrew Morton

asm-generic: provide generic page_to_phys and phys_to_page implementations

2024-10-28T21:44:28+00:00

page_to_phys is duplicated by all architectures, and from some strange reason placed in where it doesn't fit at all. phys_to_page is only provided by a few architectures despite having a lot of open coded users. Provide generic versions in to make these helpers more easily usable. Note with this patch powerpc loses the CONFIG_DEBUG_VIRTUAL pfn_valid check. It will be added back in a generic version later. Signed-off-by: Christoph Hellwig Signed-off-by: Arnd Bergmann

arm64: Expose the end of the linear map in PHYSMEM_END

2024-09-04T15:39:58+00:00

The memory hot-plug and resource management code needs to know the largest address which can fit in the linear map, so set PHYSMEM_END for that purpose. This fixes a crash at boot when amdgpu tries to create DEVICE_PRIVATE_MEMORY and is given a physical address by the resource management code which is outside the range which can have a `struct page` | Unable to handle kernel paging request at virtual address 000001ffa6000034 | user pgtable: 4k pages, 48-bit VAs, pgdp=000008000287c000 | [000001ffa6000034] pgd=0000000000000000, p4d=0000000000000000 | Call trace: | __init_zone_device_page.constprop.0+0x2c/0xa8 | memmap_init_zone_device+0xf0/0x210 | pagemap_range+0x1e0/0x410 | memremap_pages+0x18c/0x2e0 | devm_memremap_pages+0x30/0x90 | kgd2kfd_init_zone_device+0xf0/0x200 [amdgpu] | amdgpu_device_ip_init+0x674/0x888 [amdgpu] | amdgpu_device_init+0x7a4/0xea0 [amdgpu] | amdgpu_driver_load_kms+0x28/0x1c0 [amdgpu] | amdgpu_pci_probe+0x1a0/0x560 [amdgpu] Signed-off-by: D Scott Phillips Link: https://lore.kernel.org/r/20240903164532.3874988-1-scott@os.amperecomputing.com Signed-off-by: Will Deacon

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

2024-03-15T20:03:13+00:00

Pull kvm updates from Paolo Bonzini: "S390: - Changes to FPU handling came in via the main s390 pull request - Only deliver to the guest the SCLP events that userspace has requested - More virtual vs physical address fixes (only a cleanup since virtual and physical address spaces are currently the same) - Fix selftests undefined behavior x86: - Fix a restriction that the guest can't program a PMU event whose encoding matches an architectural event that isn't included in the guest CPUID. The enumeration of an architectural event only says that if a CPU supports an architectural event, then the event can be programmed *using the architectural encoding*. The enumeration does NOT say anything about the encoding when the CPU doesn't report support the event *in general*. It might support it, and it might support it using the same encoding that made it into the architectural PMU spec - Fix a variety of bugs in KVM's emulation of RDPMC (more details on individual commits) and add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID and therefore are easier to validate with selftests than with custom guests (aka kvm-unit-tests) - Zero out PMU state on AMD if the virtual PMU is disabled, it does not cause any bug but it wastes time in various cases where KVM would check if a PMC event needs to be synthesized - Optimize triggering of emulated events, with a nice ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit - Fix a bug where KVM would report stale/bogus exit qualification information when exiting to userspace with an internal error exit code - Add a VMX flag in /proc/cpuinfo to report 5-level EPT support - Rework TDP MMU root unload, free, and alloc to run with mmu_lock held for read, e.g. to avoid serializing vCPUs when userspace deletes a memslot - Tear down TDP MMU page tables at 4KiB granularity (used to be 1GiB). KVM doesn't support yielding in the middle of processing a zap, and 1GiB granularity resulted in multi-millisecond lags that are quite impolite for CONFIG_PREEMPT kernels - Allocate write-tracking metadata on-demand to avoid the memory overhead when a kernel is built with i915 virtualization support but the workloads use neither shadow paging nor i915 virtualization - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives - Fix the debugregs ABI for 32-bit KVM - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit, which allowed some optimization for both Intel and AMD - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, causing extra unnecessary work - Cleanup the logic for checking if the currently loaded vCPU is in-kernel - Harden against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel x86 Xen emulation: - Overlay pages can now be cached based on host virtual address, instead of guest physical addresses. This removes the need to reconfigure and invalidate the cache if the guest changes the gpa but the underlying host virtual address remains the same - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior) - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs RISC-V: - Support exception and interrupt handling in selftests - New self test for RISC-V architectural timer (Sstc extension) - New extension support (Ztso, Zacas) - Support userspace emulation of random number seed CSRs ARM: - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests LoongArch: - Set reserved bits as zero in CPUCFG - Start SW timer only when vcpu is blocking - Do not restart SW timer when it is expired - Remove unnecessary CSR register saving during enter guest - Misc cleanups and fixes as usual Generic: - Clean up Kconfig by removing CONFIG_HAVE_KVM, which was basically always true on all architectures except MIPS (where Kconfig determines the available depending on CPU capabilities). It is replaced either by an architecture-dependent symbol for MIPS, and IS_ENABLED(CONFIG_KVM) everywhere else - Factor common "select" statements in common code instead of requiring each architecture to specify it - Remove thoroughly obsolete APIs from the uapi headers - Move architecture-dependent stuff to uapi/asm/kvm.h - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a very unlikely use-after-free if kvm.ko is unloaded - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, so that there's no need to remember to *conditionally* clean up after the worker Selftests: - Reduce boilerplate especially when utilize selftest TAP infrastructure - Add basic smoke tests for SEV and SEV-ES, along with a pile of library support for handling private/encrypted/protected memory - Fix benign bugs where tests neglect to close() guest_memfd files" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (246 commits) selftests: kvm: remove meaningless assignments in Makefiles KVM: riscv: selftests: Add Zacas extension to get-reg-list test RISC-V: KVM: Allow Zacas extension for Guest/VM KVM: riscv: selftests: Add Ztso extension to get-reg-list test RISC-V: KVM: Allow Ztso extension for Guest/VM RISC-V: KVM: Forward SEED CSR access to user space KVM: riscv: selftests: Add sstc timer test KVM: riscv: selftests: Change vcpu_has_ext to a common function KVM: riscv: selftests: Add guest helper to get vcpu id KVM: riscv: selftests: Add exception handling support LoongArch: KVM: Remove unnecessary CSR register saving during enter guest LoongArch: KVM: Do not restart SW timer when it is expired LoongArch: KVM: Start SW timer only when vcpu is blocking LoongArch: KVM: Set reserved bits as zero in CPUCFG KVM: selftests: Explicitly close guest_memfd files in some gmem tests KVM: x86/xen: fix recursive deadlock in timer injection KVM: pfncache: simplify locking and make more self-contained KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled KVM: x86/xen: improve accuracy of Xen timers ...

KVM: arm64: Introduce new flag for non-cacheable IO memory

2024-02-24T17:57:39+00:00

Currently, KVM for ARM64 maps at stage 2 memory that is considered device (i.e. it is not RAM) with DEVICE_nGnRE memory attributes; this setting overrides (as per the ARM architecture [1]) any device MMIO mapping present at stage 1, resulting in a set-up whereby a guest operating system cannot determine device MMIO mapping memory attributes on its own but it is always overridden by the KVM stage 2 default. This set-up does not allow guest operating systems to select device memory attributes independently from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage 2 memory type attributes"), which turns out to be an issue in that guest operating systems (e.g. Linux) may request to map devices MMIO regions with memory attributes that guarantee better performance (e.g. gathering attribute - that for some devices can generate larger PCIe memory writes TLPs) and specific operations (e.g. unaligned transactions) such as the NormalNC memory type. The default device stage 2 mapping was chosen in KVM for ARM64 since it was considered safer (i.e. it would not allow guests to trigger uncontained failures ultimately crashing the machine) but this turned out to be asynchronous (SError) defeating the purpose. Failures containability is a property of the platform and is independent from the memory type used for MMIO device memory mappings. Actually, DEVICE_nGnRE memory type is even more problematic than Normal-NC memory type in terms of faults containability in that e.g. aborts triggered on DEVICE_nGnRE loads cannot be made, architecturally, synchronous (i.e. that would imply that the processor should issue at most 1 load transaction at a time - it cannot pipeline them - otherwise the synchronous abort semantics would break the no-speculation attribute attached to DEVICE_XXX memory). This means that regardless of the combined stage1+stage2 mappings a platform is safe if and only if device transactions cannot trigger uncontained failures and that in turn relies on platform capabilities and the device type being assigned (i.e. PCIe AER/DPC error containment and RAS architecture[3]); therefore the default KVM device stage 2 memory attributes play no role in making device assignment safer for a given platform (if the platform design adheres to design guidelines outlined in [3]) and therefore can be relaxed. For all these reasons, relax the KVM stage 2 device memory attributes from DEVICE_nGnRE to Normal-NC. The NormalNC was chosen over a different Normal memory type default at stage-2 (e.g. Normal Write-through) to avoid cache allocation/snooping. Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to trigger any issue on guest device reclaim use cases either (i.e. device MMIO unmap followed by a device reset) at least for PCIe devices, in that in PCIe a device reset is architected and carried out through PCI config space transactions that are naturally ordered with respect to MMIO transactions according to the PCI ordering rules. Having Normal-NC S2 default puts guests in control (thanks to stage1+stage2 combined memory attributes rules [1]) of device MMIO regions memory mappings, according to the rules described in [1] and summarized here ([(S1) - stage1], [(S2) - stage 2]): S1 | S2 | Result NORMAL-WB | NORMAL-NC | NORMAL-NC NORMAL-WT | NORMAL-NC | NORMAL-NC NORMAL-NC | NORMAL-NC | NORMAL-NC DEVICE | NORMAL-NC | DEVICE It is worth noting that currently, to map devices MMIO space to user space in a device pass-through use case the VFIO framework applies memory attributes derived from pgprot_noncached() settings applied to VMAs, which result in device-nGnRnE memory attributes for the stage-1 VMM mappings. This means that a userspace mapping for device MMIO space carried out with the current VFIO framework and a guest OS mapping for the same MMIO space may result in a mismatched alias as described in [2]. Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change anything in this respect, in that the mismatched aliases would only affect (refer to [2] for a detailed explanation) ordering between the userspace and GuestOS mappings resulting stream of transactions (i.e. it does not cause loss of property for either stream of transactions on its own), which is harmless given that the userspace and GuestOS access to the device is carried out through independent transactions streams. A Normal-NC flag is not present today. So add a new kvm_pgtable_prot (KVM_PGTABLE_PROT_NORMAL_NC) flag for it, along with its corresponding PTE value 0x5 (0b101) determined from [1]. Lastly, adapt the stage2 PTE property setter function (stage2_set_prot_attr) to handle the NormalNC attribute. The entire discussion leading to this patch series may be followed through the following links. Link: https://lore.kernel.org/all/20230907181459.18145-3-ankita@nvidia.com Link: https://lore.kernel.org/r/20231205033015.10044-1-ankita@nvidia.com [1] section D8.5.5 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [2] section B2.8 - DDI0487J_a_a-profile_architecture_reference_manual.pdf [3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf Suggested-by: Jason Gunthorpe Acked-by: Catalin Marinas Acked-by: Will Deacon Reviewed-by: Marc Zyngier Signed-off-by: Ankit Agrawal Link: https://lore.kernel.org/r/20240224150546.368-2-ankita@nvidia.com Signed-off-by: Oliver Upton

arm64: Enable LPA2 at boot if supported by the system

2024-02-16T12:42:40+00:00

Update the early kernel mapping code to take 52-bit virtual addressing into account based on the LPA2 feature. This is a bit more involved than LVA (which is supported with 64k pages only), given that some page table descriptor bits change meaning in this case. To keep the handling in asm to a minimum, the initial ID map is still created with 48-bit virtual addressing, which implies that the kernel image must be loaded into 48-bit addressable physical memory. This is currently required by the boot protocol, even though we happen to support placement outside of that for LVA/64k based configurations. Enabling LPA2 involves more than setting TCR.T1SZ to a lower value, there is also a DS bit in TCR that needs to be set, and which changes the meaning of bits [9:8] in all page table descriptors. Since we cannot enable DS and every live page table descriptor at the same time, let's pivot through another temporary mapping. This avoids the need to reintroduce manipulations of the page tables with the MMU and caches disabled. To permit the LPA2 feature to be overridden on the kernel command line, which may be necessary to work around silicon errata, or to deal with mismatched features on heterogeneous SoC designs, test for CPU feature overrides first, and only then enable LPA2. Signed-off-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20240214122845.2033971-78-ardb+git@google.com Signed-off-by: Catalin Marinas