summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)AuthorFilesLines
2024-06-28powerpc/4xx: Remove CONFIG_BOOKE_OR_40xMichael Ellerman18-26/+21
Now that 40x is gone, replace CONFIG_BOOKE_OR_40x by CONFIG_BOOKE. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240628121201.130802-5-mpe@ellerman.id.au
2024-06-28powerpc: Remove core support for 40xChristophe Leroy28-1727/+14
Now that 40x platforms have gone, remove support for 40x in the core of powerpc arch. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240628121201.130802-4-mpe@ellerman.id.au
2024-06-28powerpc: Remove 40x from Kconfig and defconfigMichael Ellerman5-43/+10
Remove 40x from Kconfig, making the code unreachable. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240628121201.130802-3-mpe@ellerman.id.au
2024-06-28powerpc/boot: Remove all 40x platforms from bootChristophe Leroy17-2875/+1
Remove 40x platforms from the boot directory. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240628121201.130802-2-mpe@ellerman.id.au
2024-06-28powerpc/40x: Remove 40x platforms.Christophe Leroy12-586/+0
40x platforms have been orphaned for many years. Remove them. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240628121201.130802-1-mpe@ellerman.id.au
2024-06-28powerpc/iommu: Reimplement the iommu_table_group_ops for pSeriesShivaprasad G Bhat4-98/+543
PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows for PEs, their ownership transfer, create/set/unset the TCE tables for the Dynamic DMA wundows(DDW). VFIOS uses these APIs for support on POWER. The commit 9d67c9433509 ("powerpc/iommu: Add "borrowing" iommu_table_group_ops") implemented partial support for this API with "borrow" mechanism wherein the DMA windows if created already by the host driver, they would be available for VFIO to use. Also, it didn't have the support to control/modify the window size or the IO page size. The current patch implements all the necessary iommu_table_group_ops APIs there by avoiding the "borrrowing". So, just the way it is on the PowerNV platform, with this patch the iommu table group ownership is transferred to the VFIO PPC subdriver, the iommu table, DMA windows creation/deletion all driven through the APIs. The pSeries uses the query-pe-dma-window, create-pe-dma-window and reset-pe-dma-window RTAS calls for DMA window creation, deletion and reset to defaul. The RTAs calls do show some minor differences to the way things are to be handled on the pSeries which are listed below. * On pSeries, the default DMA window size is "fixed" cannot be custom sized as requested by the user. For non-SRIOV VFs, It is fixed at 2GB and for SRIOV VFs, its variable sized based on the capacity assigned to it during the VF assignment to the LPAR. So, for the default DMA window alone the size if requested less than tce32_size, the smaller size is enforced using the iommu table->it_size. * The DMA start address for 32-bit window is 0, and for the 64-bit window in case of PowerNV is hardcoded to TVE select (bit 59) at 512PiB offset. This address is returned at the time of create_table() API call (even before the window is created), the subsequent set_window() call actually opens the DMA window. On pSeries, the DMA start address for 32-bit window is known from the 'ibm,dma-window' DT property. However, the 64-bit window start address is not known until the create-pe-dma RTAS call is made. So, the create_table() which returns the DMA window start address actually opens the DMA window and returns the DMA start address as returned by the Hypervisor for the create-pe-dma RTAS call. * The reset-pe-dma RTAS call resets the DMA windows and restores the default DMA window, however it does not clear the TCE table entries if there are any. In case of ownership transfer from platform domain which used direct mapping, the patch chooses remove-pe-dma instead of reset-pe for the 64-bit window intentionally so that the clear_dma_window() is called. Other than the DMA window management changes mentioned above, the patch also brings back the userspace view for the single level TCE as it existed before commit 090bad39b237a ("powerpc/powernv: Add indirect levels to it_userspace") along with the relavent refactoring. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923275958.1397.907964437142542242.stgit@linux.ibm.com
2024-06-28powerpc/iommu: Move dev_has_iommu_table() to iommu.cShivaprasad G Bhat3-16/+23
Move function dev_has_iommu_table() to powerpc/kernel/iommu.c as it is going to be used by machine specific iommu code as well in subsequent patches. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923274748.1397.6274953248403106679.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Use the iommu table[0] for IOV VF's DDWShivaprasad G Bhat1-5/+7
This patch basically brings consistency with PowerNV approach to use the first freely available iommu table when the default window is removed. The pSeries iommu code convention has been that the table[0] is for the default 32 bit DMA window and the table[1] is for the 64 bit DDW. With VFs having only 1 DMA window, the default has to be removed for creating the larger DMA window. The existing code uses the table[1] for that, while marking the table[0] as NULL. This is fine as long as the host driver itself uses the device. For the VFIO user, on pSeries there is no way to skip table[0] as the VFIO subdriver uses the first freely available table. The window 0, when created as 64-bit DDW in that context would still be on table[0], as the maximum number of windows is 1. This won't have any impact for the host driver as the table is fetched from the device's iommu_table_base. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> [mpe: Rebase and resolve conflicts] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923272328.1397.1817843961216868850.stgit@linux.ibm.com
2024-06-28powerpc/pseries/iommu: Fix the VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctl outputShivaprasad G Bhat1-14/+67
The ioctl VFIO_IOMMU_SPAPR_TCE_GET_INFO is not reporting the actuals on the platform as not all the details are correctly collected during the platform probe/scan into the iommu_table_group. Collect the information during the device setup time as the DMA window property is already looked up on parent nodes anyway. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923271138.1397.7908302630061814623.stgit@linux.ibm.com
2024-06-28powerpc/iommu: Move pSeries specific functions to pseries/iommu.cShivaprasad G Bhat3-148/+149
The PowerNV specific table_group_ops are defined in powernv/pci-ioda.c. The pSeries specific table_group_ops are sitting in the generic powerpc file. Move it to where it actually belong(pseries/iommu.c). The functions are currently defined even for CONFIG_PPC_POWERNV which are unused on PowerNV. Only code movement, no functional changes intended. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/171923269701.1397.15758640002786937132.stgit@linux.ibm.com
2024-06-17powerpc/kexec_file: fix cpus node update to FDTSourabh Jain1-16/+37
While updating the cpus node, commit 40c753993e3a ("powerpc/kexec_file: Use current CPU info while setting up FDT") first deletes all subnodes under the /cpus node. However, while adding sub-nodes back, it missed adding cpus subnodes whose device_type != "cpu", such as l2-cache*, l3-cache*, ibm,powerpc-cpu-features. Fix this by only deleting cpus sub-nodes of device_type == "cpus" and then adding all available nodes with device_type == "cpu". Fixes: 40c753993e3a ("powerpc/kexec_file: Use current CPU info while setting up FDT") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240510102235.2269496-3-sourabhjain@linux.ibm.com
2024-06-17powerpc/kexec_file: fix extra size calculation for kexec FDTSourabh Jain3-38/+33
While setting up the FDT for kexec, CPU nodes that are added after the system boots and reserved memory ranges are incorporated into the initial_boot_params (base FDT). However, they are not taken into account when determining the additional size needed for the kexec FDT. As a result, kexec fails to load, generating the following error: [1116.774451] Error updating memory reserve map: FDT_ERR_NOSPACE kexec_file_load failed: No such process Therefore, consider the extra size for CPU nodes added post-system boot and reserved memory ranges while preparing the kexec FDT. While adding a new parameter to the setup_new_fdt_ppc64 function, it was noticed that there were a couple of unused parameters, so they were removed. Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240510102235.2269496-2-sourabhjain@linux.ibm.com
2024-06-17powerpc/perf: Set cpumode flags using sample addressAnjali K3-28/+23
Currently in some cases, when the sampled instruction address register latches to a specific address during sampling, the privilege bits captured in the sampled event register are incorrect. For example, a snippet from the perf report on a power10 system is: Overhead Address Command Shared Object Symbol ........ .................. ............ ................. ....................... 2.41% 0x7fff9f94a02c null_syscall [unknown] [k] 0x00007fff9f94a02c 2.20% 0x7fff9f94a02c null_syscall libc.so.6 [.] syscall perf_get_misc_flags() function looks at the privilege bits to return the corresponding flags to be used for the address symbol and these privilege bit details are read from the sampled event register. In the above snippet, address "0x00007fff9f94a02c" is shown as "k" (kernel) due to the incorrect privilege bits captured in the sampled event register. To address this case check whether the sampled address is in the kernel area. Since this is specific to the latest platform, a new pmu flag is added called "PPMU_P10" and is used to contain the proposed fix. PPMU_P10_DD1 marked events are also included under PPMU_P10, hence remove the code specific to PPMU_P10_DD1 marked events. Signed-off-by: Anjali K <anjalik@linux.ibm.com> Reviewed-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com <mailto:atrajeev@linux.vnet.ibm.com>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240528040356.2722275-1-anjalik@linux.ibm.com
2024-06-12powerpc/configs: drop RT_GROUP_SCHED=y from ppc6xx_defconfigCeleste Liu1-1/+0
Commit 6e5f1537833a ("powerpc: Add a 6xx defconfig") said it was copied from fedore's ppc32 defconfig, but at least since 2015-06-10, Fedora has dropped this option.[1] For cgroup v1, if turned on, and there's any cgroup in the "cpu" hierarchy it needs an RT budget assigned, otherwise the processes in it will not be able to get RT at all. The problem with RT group scheduling is that it requires the budget assigned but there's no way we could assign a default budget, since the values to assign are both upper and lower time limits, are absolute, and need to be sum up to < 1 for each individal cgroup. That means we cannot really come up with values that would work by default in the general case.[1] For cgroup v2, it's almost unusable as well. If it turned on, the cpu controller can only be enabled when all RT processes are in the root cgroup. But it will lose the benefits of cgroup v2 if all RT process were placed in the same cgroup. systemd also doesn't support it.[2] [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1229700 [2]: https://github.com/systemd/systemd/issues/13781#issuecomment-549164383 Signed-off-by: Celeste Liu <CoelacanthusHex@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240530111947.549474-12-CoelacanthusHex@gmail.com
2024-06-04powerpc/mm/drmem: Silence drmem_init() early returnNathan Lynch1-3/+1
It's not an error or noteworthy condition if the "ibm,dynamic-reconfiguration-memory" node isn't present. Drop the needless message. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240603-silence-drmem_init-v1-1-e9d71646bc3d@linux.ibm.com
2024-06-04powerpc/pseries/vas: Use usleep_range() to support HCALL delayHaren Myneni1-1/+21
VAS allocate, modify and deallocate HCALLs returns H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy delay and expects OS to reissue HCALL after that delay. But using msleep() will often sleep at least 20 msecs even though the hypervisor suggests OS reissue these HCALLs after 1 or 10msecs. The open and close VAS window functions hold mutex and then issue these HCALLs. So these operations can take longer than the necessary when multiple threads issue open or close window APIs simultaneously, especially might affect the performance in the case of repeat open/close APIs for each compression request. Multiple tasks can open / close VAS windows at the same time which depends on the available VAS credits. For example, 240 cores system provides 4800 VAS credits. It means 4800 tasks can execute open VAS windows HCALLs with the mutex. Since each msleep() will often sleep more than 20 msecs, some tasks are waiting more than 120 secs to acquire mutex. It can cause hung traces for these tasks in dmesg due to mutex contention around open/close HCALLs. Instead of msleep(), use usleep_range() to ensure sleep with the expected value before issuing HCALL again. So since each task sleep 10 msecs maximum, this patch allow more tasks can issue open/close VAS calls without any hung traces in the dmesg. Signed-off-by: Haren Myneni <haren@linux.ibm.com> Suggested-by: Nathan Lynch <nathanl@linux.ibm.com> Reviewed-by: Nathan Lynch <nathanl@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240116055910.421605-1-haren@linux.ibm.com
2024-06-04powerpc/numa: Online a node if PHB is attached.Nilay Shroff2-1/+27
In the current design, a numa-node is made online only if that node is attached to cpu/memory. With this design, if any PCI/IO device is found to be attached to a numa-node which is not online then the numa-node id of the corresponding PCI/IO device is set to NUMA_NO_NODE(-1). This design may negatively impact the performance of PCIe device if the numa-node assigned to PCIe device is -1 because in such case we may not be able to accurately calculate the distance between two nodes. The multi-controller NVMe PCIe disk has an issue with calculating the node distance if the PCIe NVMe controller is attached to a PCI host bridge which has numa-node id value set to NUMA_NO_NODE. This patch helps fix this ensuring that a cpu/memory less numa node is made online if it's attached to PCI host bridge. Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240517142531.3273464-3-nilay@linux.ibm.com
2024-06-04powerpc/pseries/iommu: Split Dynamic DMA Window to be used in Hybrid modeGaurav Batra2-17/+54
Dynamic DMA Window (DDW) supports TCEs that are backed by 2MB page size. In most configurations, DDW is big enough to pre-map all of LPAR memory for IO. Pre-mapping of memory for DMA results in improvements in IO performance. Persistent memory, vPMEM, can be assigned to an LPAR as well. vPMEM is not contiguous with LPAR memory and usually is assigned at high memory addresses. This makes is not possible to pre-map both vPMEM and LPAR memory in the same DDW. For a dedicated adapter this limitation is not an issue. Dedicated adapters can have both Default DMA window, which is backed by 4K page size and a DDW backed by 2MB page size TCEs. In this scenario, LPAR memory is pre-mapped in the DDW. Any DMA going to the vPMEM is routed via dynamically allocated TCEs in the default window. The issue arises with SR-IOV adapters. There is only one DMA window - either Default or DDW. If an LPAR has vPMEM assigned, memory is not pre-mapped in the DDW since TCEs needs to be allocated for vPMEM as well. In this case, DDW is created and TCEs are dynamically allocated for both vPMEM and LPAR memory. Today, DDW is only used in single mode - direct mapped TCEs or dynamically mapped TCEs. This enhancement breaks a single DDW in 2 regions - 1. First region to pre-map LPAR memory 2. Second region to dynamically allocate TCEs for IO to vPMEM The DDW is split only if it is big enough to pre-map complete LPAR memory and still have some space left to dynamically map vPMEM. Maximum size possible DDW is created as permitted by the Hypervisor. Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240514014608.35537-1-gbatra@linux.ibm.com
2024-06-03powerpc/pseries: Remove unused cede related functionsGautam Menghani1-28/+0
Remove extended_cede_processor() and its helpers as extended_cede_processor() has no callers since commit 48f6e7f6d948("powerpc/pseries: remove cede offline state for CPUs") Signed-off-by: Gautam Menghani <gautam@linux.ibm.com> Acked-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240514132457.292865-1-gautam@linux.ibm.com
2024-05-30powerpc: Limit ARCH_HAS_KERNEL_FPU_SUPPORT to PPC64Samuel Holland1-1/+1
When building a 32-bit kernel, some toolchains do not allow mixing soft float and hard float object files: LD vmlinux.o powerpc64le-unknown-linux-musl-ld: lib/test_fpu_impl.o uses hard float, arch/powerpc/kernel/udbg.o uses soft float powerpc64le-unknown-linux-musl-ld: failed to merge target specific data of file lib/test_fpu_impl.o make[2]: *** [scripts/Makefile.vmlinux_o:62: vmlinux.o] Error 1 make[1]: *** [Makefile:1152: vmlinux_o] Error 2 make: *** [Makefile:240: __sub-make] Error 2 This is not an issue when building a 64-bit kernel. To unbreak the build, limit ARCH_HAS_KERNEL_FPU_SUPPORT to 64-bit kernels. This is okay because the only real user of this option, amdgpu, was previously limited to PPC64 anyway; see commit a28e4b672f04 ("drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT"). Fixes: 01db473e1aa3 ("powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405250851.Z4daYSWG-lkp@intel.com/ Reported-by: Guenter Roeck <linux@roeck-us.net> Closes: https://lore.kernel.org/lkml/eeffaec3-df63-4e55-ab7a-064a65c00efa@roeck-us.net/ Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240529162852.1209-1-samuel.holland@sifive.com
2024-05-30powerpc/uaccess: Use YZ asm constraint for ldMichael Ellerman1-0/+11
The 'ld' instruction requires a 4-byte aligned displacement because it is a DS-form instruction. But the "m" asm constraint doesn't enforce that. Add a special case of __get_user_asm2_goto() so that the "YZ" constraint can be used for "ld". The "Z" constraint is documented in the GCC manual PowerPC machine constraints, and specifies a "memory operand accessed with indexed or indirect addressing". "Y" is not documented in the manual but specifies a "memory operand for a DS-form instruction". Using both allows the compiler to generate a DS-form "ld" or X-form "ldx" as appropriate. The change has to be conditional on CONFIG_PPC_KERNEL_PREFIXED because the "Y" constraint does not guarantee 4-byte alignment when prefixed instructions are enabled. No build errors have been reported due to this, but the possibility is there depending on compiler code generation decisions. Fixes: c20beffeec3c ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240529123029.146953-2-mpe@ellerman.id.au
2024-05-30powerpc/uaccess: Fix build errors seen with GCC 13/14Michael Ellerman1-0/+16
Building ppc64le_defconfig with GCC 14 fails with assembler errors: CC fs/readdir.o /tmp/ccdQn0mD.s: Assembler messages: /tmp/ccdQn0mD.s:212: Error: operand out of domain (18 is not a multiple of 4) /tmp/ccdQn0mD.s:226: Error: operand out of domain (18 is not a multiple of 4) ... [6 lines] /tmp/ccdQn0mD.s:1699: Error: operand out of domain (18 is not a multiple of 4) A snippet of the asm shows: # ../fs/readdir.c:210: unsafe_copy_dirent_name(dirent->d_name, name, namlen, efault_end); ld 9,0(29) # MEM[(u64 *)name_38(D) + _88 * 1], MEM[(u64 *)name_38(D) + _88 * 1] # 210 "../fs/readdir.c" 1 1: std 9,18(8) # put_user # *__pus_addr_52, MEM[(u64 *)name_38(D) + _88 * 1] The 'std' instruction requires a 4-byte aligned displacement because it is a DS-form instruction, and as the assembler says, 18 is not a multiple of 4. A similar error is seen with GCC 13 and CONFIG_UBSAN_SIGNED_WRAP=y. The fix is to change the constraint on the memory operand to put_user(), from "m" which is a general memory reference to "YZ". The "Z" constraint is documented in the GCC manual PowerPC machine constraints, and specifies a "memory operand accessed with indexed or indirect addressing". "Y" is not documented in the manual but specifies a "memory operand for a DS-form instruction". Using both allows the compiler to generate a DS-form "std" or X-form "stdx" as appropriate. The change has to be conditional on CONFIG_PPC_KERNEL_PREFIXED because the "Y" constraint does not guarantee 4-byte alignment when prefixed instructions are enabled. Unfortunately clang doesn't support the "Y" constraint so that has to be behind an ifdef. Although the build error is only seen with GCC 13/14, that appears to just be luck. The constraint has been incorrect since it was first added. Fixes: c20beffeec3c ("powerpc/uaccess: Use flexible addressing with __put_user()/__get_user()") Cc: stable@vger.kernel.org # v5.10+ Suggested-by: Kewen Lin <linkw@gcc.gnu.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240529123029.146953-1-mpe@ellerman.id.au
2024-05-30powerpc/pseries/lparcfg: drop error message from guest name lookupNathan Lynch1-2/+2
It's not an error or exceptional situation when the hosting environment does not expose a name for the LP/guest via RTAS or the device tree. This happens with qemu when run without the '-name' option. The message also lacks a newline. Remove it. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: eddaa9a40275 ("powerpc/pseries: read the lpar name from the firmware") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240524-lparcfg-updates-v2-1-62e2e9d28724@linux.ibm.com
2024-05-29powerpc/bpf: enforce full ordering for ATOMIC operations with BPF_FETCHPuranjay Mohan2-0/+24
The Linux Kernel Memory Model [1][2] requires RMW operations that have a return value to be fully ordered. BPF atomic operations with BPF_FETCH (including BPF_XCHG and BPF_CMPXCHG) return a value back so they need to be JITed to fully ordered operations. POWERPC currently emits relaxed operations for these. We can show this by running the following litmus-test: PPC SB+atomic_add+fetch { 0:r0=x; (* dst reg assuming offset is 0 *) 0:r1=2; (* src reg *) 0:r2=1; 0:r4=y; (* P0 writes to this, P1 reads this *) 0:r5=z; (* P1 writes to this, P0 reads this *) 0:r6=0; 1:r2=1; 1:r4=y; 1:r5=z; } P0 | P1 ; stw r2, 0(r4) | stw r2,0(r5) ; | ; loop:lwarx r3, r6, r0 | ; mr r8, r3 | ; add r3, r3, r1 | sync ; stwcx. r3, r6, r0 | ; bne loop | ; mr r1, r8 | ; | ; lwa r7, 0(r5) | lwa r7,0(r4) ; ~exists(0:r7=0 /\ 1:r7=0) Witnesses Positive: 9 Negative: 3 Condition ~exists (0:r7=0 /\ 1:r7=0) Observation SB+atomic_add+fetch Sometimes 3 9 This test shows that the older store in P0 is reordered with a newer load to a different address. Although there is a RMW operation with fetch between them. Adding a sync before and after RMW fixes the issue: Witnesses Positive: 9 Negative: 0 Condition ~exists (0:r7=0 /\ 1:r7=0) Observation SB+atomic_add+fetch Never 0 9 [1] https://www.kernel.org/doc/Documentation/memory-barriers.txt [2] https://www.kernel.org/doc/Documentation/atomic_t.txt Fixes: aea7ef8a82c0 ("powerpc/bpf/32: add support for BPF_ATOMIC bitwise operations") Fixes: 2d9206b22743 ("powerpc/bpf/32: Add instructions for atomic_[cmp]xchg") Fixes: dbe6e2456fb0 ("powerpc/bpf/64: add support for atomic fetch operations") Fixes: 1e82dfaa7819 ("powerpc/bpf/64: Add instructions for atomic_[cmp]xchg") Cc: stable@vger.kernel.org # v6.0+ Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Naveen N Rao <naveen@kernel.org> Acked-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240513100248.110535-1-puranjay@kernel.org
2024-05-24Merge tag 'mm-stable-2024-05-24-11-49' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more mm updates from Andrew Morton: "Jeff Xu's implementation of the mseal() syscall" * tag 'mm-stable-2024-05-24-11-49' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: selftest mm/mseal read-only elf memory segment mseal: add documentation selftest mm/mseal memory sealing mseal: add mseal syscall mseal: wire up mseal syscall
2024-05-24mseal: wire up mseal syscallJeff Xu1-0/+1
Patch series "Introduce mseal", v10. This patchset proposes a new mseal() syscall for the Linux kernel. In a nutshell, mseal() protects the VMAs of a given virtual memory range against modifications, such as changes to their permission bits. Modern CPUs support memory permissions, such as the read/write (RW) and no-execute (NX) bits. Linux has supported NX since the release of kernel version 2.6.8 in August 2004 [1]. The memory permission feature improves the security stance on memory corruption bugs, as an attacker cannot simply write to arbitrary memory and point the code to it. The memory must be marked with the X bit, or else an exception will occur. Internally, the kernel maintains the memory permissions in a data structure called VMA (vm_area_struct). mseal() additionally protects the VMA itself against modifications of the selected seal type. Memory sealing is useful to mitigate memory corruption issues where a corrupted pointer is passed to a memory management system. For example, such an attacker primitive can break control-flow integrity guarantees since read-only memory that is supposed to be trusted can become writable or .text pages can get remapped. Memory sealing can automatically be applied by the runtime loader to seal .text and .rodata pages and applications can additionally seal security critical data at runtime. A similar feature already exists in the XNU kernel with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the mimmutable syscall [4]. Also, Chrome wants to adopt this feature for their CFI work [2] and this patchset has been designed to be compatible with the Chrome use case. Two system calls are involved in sealing the map: mmap() and mseal(). The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. The idea that inspired this patch comes from Stephen Röttger’s work in V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this API. Indeed, the Chrome browser has very specific requirements for sealing, which are distinct from those of most applications. For example, in the case of libc, sealing is only applied to read-only (RO) or read-execute (RX) memory segments (such as .text and .RELRO) to prevent them from becoming writable, the lifetime of those mappings are tied to the lifetime of the process. Chrome wants to seal two large address space reservations that are managed by different allocators. The memory is mapped RW- and RWX respectively but write access to it is restricted using pkeys (or in the future ARM permission overlay extensions). The lifetime of those mappings are not tied to the lifetime of the process, therefore, while the memory is sealed, the allocators still need to free or discard the unused memory. For example, with madvise(DONTNEED). However, always allowing madvise(DONTNEED) on this range poses a security risk. For example if a jump instruction crosses a page boundary and the second page gets discarded, it will overwrite the target bytes with zeros and change the control flow. Checking write-permission before the discard operation allows us to control when the operation is valid. In this case, the madvise will only succeed if the executing thread has PKEY write permissions and PKRU changes are protected in software by control-flow integrity. Although the initial version of this patch series is targeting the Chrome browser as its first user, it became evident during upstream discussions that we would also want to ensure that the patch set eventually is a complete solution for memory sealing and compatible with other use cases. The specific scenario currently in mind is glibc's use case of loading and sealing ELF executables. To this end, Stephen is working on a change to glibc to add sealing support to the dynamic linker, which will seal all non-writable segments at startup. Once this work is completed, all applications will be able to automatically benefit from these new protections. In closing, I would like to formally acknowledge the valuable contributions received during the RFC process, which were instrumental in shaping this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Liam R. Howlett: perf optimization. Linus Torvalds: assisting in defining system call signature and scope. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. MM perf benchmarks ================== This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to check the VMAs’ sealing flag, so that no partial update can be made, when any segment within the given memory range is sealed. To measure the performance impact of this loop, two tests are developed. [8] The first is measuring the time taken for a particular system call, by using clock_gettime(CLOCK_MONOTONIC). The second is using PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have similar results. The tests have roughly below sequence: for (i = 0; i < 1000, i++) create 1000 mappings (1 page per VMA) start the sampling for (j = 0; j < 1000, j++) mprotect one mapping stop and save the sample delete 1000 mappings calculates all samples. Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz, 4G memory, Chromebook. Based on the latest upstream code: The first test (measuring time) syscall__ vmas t t_mseal delta_ns per_vma % munmap__ 1 909 944 35 35 104% munmap__ 2 1398 1502 104 52 107% munmap__ 4 2444 2594 149 37 106% munmap__ 8 4029 4323 293 37 107% munmap__ 16 6647 6935 288 18 104% munmap__ 32 11811 12398 587 18 105% mprotect 1 439 465 26 26 106% mprotect 2 1659 1745 86 43 105% mprotect 4 3747 3889 142 36 104% mprotect 8 6755 6969 215 27 103% mprotect 16 13748 14144 396 25 103% mprotect 32 27827 28969 1142 36 104% madvise_ 1 240 262 22 22 109% madvise_ 2 366 442 76 38 121% madvise_ 4 623 751 128 32 121% madvise_ 8 1110 1324 215 27 119% madvise_ 16 2127 2451 324 20 115% madvise_ 32 4109 4642 534 17 113% The second test (measuring cpu cycle) syscall__ vmas cpu cmseal delta_cpu per_vma % munmap__ 1 1790 1890 100 100 106% munmap__ 2 2819 3033 214 107 108% munmap__ 4 4959 5271 312 78 106% munmap__ 8 8262 8745 483 60 106% munmap__ 16 13099 14116 1017 64 108% munmap__ 32 23221 24785 1565 49 107% mprotect 1 906 967 62 62 107% mprotect 2 3019 3203 184 92 106% mprotect 4 6149 6569 420 105 107% mprotect 8 9978 10524 545 68 105% mprotect 16 20448 21427 979 61 105% mprotect 32 40972 42935 1963 61 105% madvise_ 1 434 497 63 63 115% madvise_ 2 752 899 147 74 120% madvise_ 4 1313 1513 200 50 115% madvise_ 8 2271 2627 356 44 116% madvise_ 16 4312 4883 571 36 113% madvise_ 32 8376 9319 943 29 111% Based on the result, for 6.8 kernel, sealing check adds 20-40 nano seconds, or around 50-100 CPU cycles, per VMA. In addition, I applied the sealing to 5.10 kernel: The first test (measuring time) syscall__ vmas t tmseal delta_ns per_vma % munmap__ 1 357 390 33 33 109% munmap__ 2 442 463 21 11 105% munmap__ 4 614 634 20 5 103% munmap__ 8 1017 1137 120 15 112% munmap__ 16 1889 2153 263 16 114% munmap__ 32 4109 4088 -21 -1 99% mprotect 1 235 227 -7 -7 97% mprotect 2 495 464 -30 -15 94% mprotect 4 741 764 24 6 103% mprotect 8 1434 1437 2 0 100% mprotect 16 2958 2991 33 2 101% mprotect 32 6431 6608 177 6 103% madvise_ 1 191 208 16 16 109% madvise_ 2 300 324 24 12 108% madvise_ 4 450 473 23 6 105% madvise_ 8 753 806 53 7 107% madvise_ 16 1467 1592 125 8 108% madvise_ 32 2795 3405 610 19 122% The second test (measuring cpu cycle) syscall__ nbr_vma cpu cmseal delta_cpu per_vma % munmap__ 1 684 715 31 31 105% munmap__ 2 861 898 38 19 104% munmap__ 4 1183 1235 51 13 104% munmap__ 8 1999 2045 46 6 102% munmap__ 16 3839 3816 -23 -1 99% munmap__ 32 7672 7887 216 7 103% mprotect 1 397 443 46 46 112% mprotect 2 738 788 50 25 107% mprotect 4 1221 1256 35 9 103% mprotect 8 2356 2429 72 9 103% mprotect 16 4961 4935 -26 -2 99% mprotect 32 9882 10172 291 9 103% madvise_ 1 351 380 29 29 108% madvise_ 2 565 615 49 25 109% madvise_ 4 872 933 61 15 107% madvise_ 8 1508 1640 132 16 109% madvise_ 16 3078 3323 245 15 108% madvise_ 32 5893 6704 811 25 114% For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30 CPU cycles, there is even decrease in some cases. It might be interesting to compare 5.10 and 6.8 kernel The first test (measuring time) syscall__ vmas t_5_10 t_6_8 delta_ns per_vma % munmap__ 1 357 909 552 552 254% munmap__ 2 442 1398 956 478 316% munmap__ 4 614 2444 1830 458 398% munmap__ 8 1017 4029 3012 377 396% munmap__ 16 1889 6647 4758 297 352% munmap__ 32 4109 11811 7702 241 287% mprotect 1 235 439 204 204 187% mprotect 2 495 1659 1164 582 335% mprotect 4 741 3747 3006 752 506% mprotect 8 1434 6755 5320 665 471% mprotect 16 2958 13748 10790 674 465% mprotect 32 6431 27827 21397 669 433% madvise_ 1 191 240 49 49 125% madvise_ 2 300 366 67 33 122% madvise_ 4 450 623 173 43 138% madvise_ 8 753 1110 357 45 147% madvise_ 16 1467 2127 660 41 145% madvise_ 32 2795 4109 1314 41 147% The second test (measuring cpu cycle) syscall__ vmas cpu_5_10 c_6_8 delta_cpu per_vma % munmap__ 1 684 1790 1106 1106 262% munmap__ 2 861 2819 1958 979 327% munmap__ 4 1183 4959 3776 944 419% munmap__ 8 1999 8262 6263 783 413% munmap__ 16 3839 13099 9260 579 341% munmap__ 32 7672 23221 15549 486 303% mprotect 1 397 906 509 509 228% mprotect 2 738 3019 2281 1140 409% mprotect 4 1221 6149 4929 1232 504% mprotect 8 2356 9978 7622 953 423% mprotect 16 4961 20448 15487 968 412% mprotect 32 9882 40972 31091 972 415% madvise_ 1 351 434 82 82 123% madvise_ 2 565 752 186 93 133% madvise_ 4 872 1313 442 110 151% madvise_ 8 1508 2271 763 95 151% madvise_ 16 3078 4312 1234 77 140% madvise_ 32 5893 8376 2483 78 142% From 5.10 to 6.8 munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma. mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma. madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma. In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the increase from 5.10 to 6.8 is significantly larger, approximately ten times greater for munmap and mprotect. When I discuss the mm performance with Brian Makin, an engineer who worked on performance, it was brought to my attention that such performance benchmarks, which measuring millions of mm syscall in a tight loop, may not accurately reflect real-world scenarios, such as that of a database service. Also this is tested using a single HW and ChromeOS, the data from another HW or distribution might be different. It might be best to take this data with a grain of salt. This patch (of 5): Wire up mseal syscall for all architectures. Link: https://lkml.kernel.org/r/20240415163527.626541-1-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-2-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> [Bug #2] Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'mm-nonmm-stable-2024-05-22-17-30' of ↵Linus Torvalds3-1/+33
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more non-mm updates from Andrew Morton: - A series ("kbuild: enable more warnings by default") from Arnd Bergmann which enables a number of additional build-time warnings. We fixed all the fallout which we could find, there may still be a few stragglers. - Samuel Holland has developed the series "Unified cross-architecture kernel-mode FPU API". This does a lot of consolidation of per-architecture kernel-mode FPU usage and enables the use of newer AMD GPUs on RISC-V. - Tao Su has fixed some selftests build warnings in the series "Selftests: Fix compilation warnings due to missing _GNU_SOURCE definition". - This pull also includes a nilfs2 fixup from Ryusuke Konishi. * tag 'mm-nonmm-stable-2024-05-22-17-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits) nilfs2: make block erasure safe in nilfs_finish_roll_forward() selftests/harness: use 1024 in place of LINE_MAX Revert "selftests/harness: remove use of LINE_MAX" selftests/fpu: allow building on other architectures selftests/fpu: move FP code to a separate translation unit drm/amd/display: use ARCH_HAS_KERNEL_FPU_SUPPORT drm/amd/display: only use hard-float, not altivec on powerpc riscv: add support for kernel-mode FPU x86: implement ARCH_HAS_KERNEL_FPU_SUPPORT powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORT LoongArch: implement ARCH_HAS_KERNEL_FPU_SUPPORT lib/raid6: use CC_FLAGS_FPU for NEON CFLAGS arm64: crypto: use CC_FLAGS_FPU for NEON CFLAGS arm64: implement ARCH_HAS_KERNEL_FPU_SUPPORT ARM: crypto: use CC_FLAGS_FPU for NEON CFLAGS ARM: implement ARCH_HAS_KERNEL_FPU_SUPPORT arch: add ARCH_HAS_KERNEL_FPU_SUPPORT x86/fpu: fix asm/fpu/types.h include guard kbuild: enable -Wcast-function-type-strict unconditionally kbuild: enable -Wformat-truncation on clang ...
2024-05-23tracing/treewide: Remove second parameter of __assign_str()Steven Rostedt (Google)1-2/+2
With the rework of how the __string() handles dynamic strings where it saves off the source string in field in the helper structure[1], the assignment of that value to the trace event field is stored in the helper value and does not need to be passed in again. This means that with: __string(field, mystring) Which use to be assigned with __assign_str(field, mystring), no longer needs the second parameter and it is unused. With this, __assign_str() will now only get a single parameter. There's over 700 users of __assign_str() and because coccinelle does not handle the TRACE_EVENT() macro I ended up using the following sed script: git grep -l __assign_str | while read a ; do sed -e 's/\(__assign_str([^,]*[^ ,]\) *,[^;]*/\1)/' $a > /tmp/test-file; mv /tmp/test-file $a; done I then searched for __assign_str() that did not end with ';' as those were multi line assignments that the sed script above would fail to catch. Note, the same updates will need to be done for: __assign_str_len() __assign_rel_str() __assign_rel_str_len() I tested this with both an allmodconfig and an allyesconfig (build only for both). [1] https://lore.kernel.org/linux-trace-kernel/20240222211442.634192653@goodmis.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240516133454.681ba6a0@rorschach.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Christian König <christian.koenig@amd.com> for the amdgpu parts. Acked-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> #for Acked-by: Rafael J. Wysocki <rafael@kernel.org> # for thermal Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> # xfs Tested-by: Guenter Roeck <linux@roeck-us.net>
2024-05-22Merge tag 'driver-core-6.10-rc1' of ↵Linus Torvalds2-19/+1
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the small set of driver core and kernfs changes for 6.10-rc1. Nothing major here at all, just a small set of changes for some driver core apis, and minor fixups. Included in here are: - sysfs_bin_attr_simple_read() helper added and used - device_show_string() helper added and used All usages of these were acked by the various maintainers. Also in here are: - kernfs minor cleanup - removed unused functions - typo fix in documentation - pay attention to sysfs_create_link() failures in module.c finally All of these have been in linux-next for a very long time with no reported problems" * tag 'driver-core-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: device property: Fix a typo in the description of device_get_child_node_count() kernfs: mount: Remove unnecessary ‘NULL’ values from knparent scsi: Use device_show_string() helper for sysfs attributes platform/x86: Use device_show_string() helper for sysfs attributes perf: Use device_show_string() helper for sysfs attributes IB/qib: Use device_show_string() helper for sysfs attributes hwmon: Use device_show_string() helper for sysfs attributes driver core: Add device_show_string() helper for sysfs attributes treewide: Use sysfs_bin_attr_simple_read() helper sysfs: Add sysfs_bin_attr_simple_read() helper module: don't ignore sysfs_create_link() failures driver core: Remove unused platform_notify, platform_notify_remove
2024-05-21Merge tag 'asm-generic-6.10' of ↵Linus Torvalds2-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic cleanups from Arnd Bergmann: "These are a few cross-architecture cleanup patches: - separate out fbdev support from the asm/video.h contents that may be used by either the old fbdev drivers or the newer drm display code (Thomas Zimmermann) - cleanups for the generic bitops code and asm-generic/bug.h (Thorsten Blum) - remove the orphaned include/asm-generic/page.h header that used to be included by long-removed mmu-less architectures (me)" * tag 'asm-generic-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: arch: Fix name collision with ACPI's video.o bug: Improve comment asm-generic: remove unused asm-generic/page.h arch: Rename fbdev header and source files arch: Remove struct fb_info from video helpers arch: Select fbdev helpers with CONFIG_VIDEO bitops: Change function return types from long to int
2024-05-20powerpc: implement ARCH_HAS_KERNEL_FPU_SUPPORTSamuel Holland3-1/+33
PowerPC provides an equivalent to the common kernel-mode FPU API, but in a different header and using different function names. The PowerPC API also requires a non-preemptible context. Add a wrapper header, and export the CFLAGS adjustments. Link: https://lkml.kernel.org/r/20240329072441.591471-9-samuel.holland@sifive.com Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Palmer Dabbelt <palmer@rivosinc.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Xuerui <git@xen0n.name> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-05-19Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds16-111/+52
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
2024-05-18Merge tag 'kbuild-v6.10' of ↵Linus Torvalds5-22/+9
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...
2024-05-18Merge tag 'probes-v6.10' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes updates from Masami Hiramatsu: - tracing/probes: Add new pseudo-types %pd and %pD support for dumping dentry name from 'struct dentry *' and file name from 'struct file *' - uprobes performance optimizations: - Speed up the BPF uprobe event by delaying the fetching of the uprobe event arguments that are not used in BPF - Avoid locking by speculatively checking whether uprobe event is valid - Reduce lock contention by using read/write_lock instead of spinlock for uprobe list operation. This improved BPF uprobe benchmark result 43% on average - rethook: Remove non-fatal warning messages when tracing stack from BPF and skip rcu_is_watching() validation in rethook if possible - objpool: Optimize objpool (which is used by kretprobes and fprobe as rethook backend storage) by inlining functions and avoid caching nr_cpu_ids because it is a const value - fprobe: Add entry/exit callbacks types (code cleanup) - kprobes: Check ftrace was killed in kprobes if it uses ftrace * tag 'probes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: kprobe/ftrace: bail out if ftrace was killed selftests/ftrace: Fix required features for VFS type test case objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids objpool: enable inlining objpool_push() and objpool_pop() operations rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get() ftrace: make extra rcu_is_watching() validation check optional uprobes: reduce contention on uprobes_tree access rethook: Remove warning messages printed for finding return address of a frame. fprobe: Add entry/exit callbacks types selftests/ftrace: add fprobe test cases for VFS type "%pd" and "%pD" selftests/ftrace: add kprobe test cases for VFS type "%pd" and "%pD" Documentation: tracing: add new type '%pd' and '%pD' for kprobe tracing/probes: support '%pD' type for print struct file's name tracing/probes: support '%pd' type for print struct dentry's name uprobes: add speculative lockless system-wide uprobe filter check uprobes: prepare uprobe args buffer lazily uprobes: encapsulate preparation of uprobe args buffer
2024-05-17Merge tag 'powerpc-6.10-1' of ↵Linus Torvalds136-1006/+2106
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Enable BPF Kernel Functions (kfuncs) in the powerpc BPF JIT. - Allow per-process DEXCR (Dynamic Execution Control Register) settings via prctl, notably NPHIE which controls hashst/hashchk for ROP protection. - Install powerpc selftests in sub-directories. Note this changes the way run_kselftest.sh needs to be invoked for powerpc selftests. - Change fadump (Firmware Assisted Dump) to better handle memory add/remove. - Add support for passing additional parameters to the fadump kernel. - Add support for updating the kdump image on CPU/memory add/remove events. - Other small features, cleanups and fixes. Thanks to Andrew Donnellan, Andy Shevchenko, Aneesh Kumar K.V, Arnd Bergmann, Benjamin Gray, Bjorn Helgaas, Christian Zigotzky, Christophe Jaillet, Christophe Leroy, Colin Ian King, Cédric Le Goater, Dr. David Alan Gilbert, Erhard Furtner, Frank Li, GUO Zihua, Ganesh Goudar, Geoff Levand, Ghanshyam Agrawal, Greg Kurz, Hari Bathini, Joel Stanley, Justin Stitt, Kunwu Chan, Li Yang, Lidong Zhong, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Matthias Schiffer, Naresh Kamboju, Nathan Chancellor, Nathan Lynch, Naveen N Rao, Nicholas Miehlbradt, Ran Wang, Randy Dunlap, Ritesh Harjani, Sachin Sant, Shirisha Ganta, Shrikanth Hegde, Sourabh Jain, Stephen Rothwell, sundar, Thorsten Blum, Vaibhav Jain, Xiaowei Bao, Yang Li, and Zhao Chenhui. * tag 'powerpc-6.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (85 commits) powerpc/fadump: Fix section mismatch warning powerpc/85xx: fix compile error without CONFIG_CRASH_DUMP powerpc/fadump: update documentation about bootargs_append powerpc/fadump: pass additional parameters when fadump is active powerpc/fadump: setup additional parameters for dump capture kernel powerpc/pseries/fadump: add support for multiple boot memory regions selftests/powerpc/dexcr: Fix spelling mistake "predicition" -> "prediction" KVM: PPC: Book3S HV nestedv2: Fix an error handling path in gs_msg_ops_kvmhv_nestedv2_config_fill_info() KVM: PPC: Fix documentation for ppc mmu caps KVM: PPC: code cleanup for kvmppc_book3s_irqprio_deliver KVM: PPC: Book3S HV nestedv2: Cancel pending DEC exception powerpc/xmon: Check cpu id in commands "c#", "dp#" and "dx#" powerpc/code-patching: Use dedicated memory routines for patching powerpc/code-patching: Test patch_instructions() during boot powerpc64/kasan: Pass virtual addresses to kasan_init_phys_region() powerpc: rename SPRN_HID2 define to SPRN_HID2_750FX powerpc: Fix typos powerpc/eeh: Fix spelling of the word "auxillary" and update comment macintosh/ams: Fix unused variable warning powerpc/Makefile: Remove bits related to the previous use of -mcmodel=large ...
2024-05-16powerpc/fadump: Fix section mismatch warningMichael Ellerman1-1/+1
With some compilers/configs fadump_setup_param_area() isn't inlined into its caller (which is __init), leading to a section mismatch warning: WARNING: modpost: vmlinux: section mismatch in reference: fadump_setup_param_area+0x200 (section: .text.fadump_setup_param_area) -> memblock_phys_alloc_range (section: .init.text) Fix it by adding an __init annotation. Fixes: 683eab94da75 ("powerpc/fadump: setup additional parameters for dump capture kernel") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/all/20240515163708.3380c4d1@canb.auug.org.au/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/all/202405140922.oucLOx4Y-lkp@intel.com/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240516132631.347956-1-mpe@ellerman.id.au
2024-05-16kprobe/ftrace: bail out if ftrace was killedStephen Brennan1-0/+3
If an error happens in ftrace, ftrace_kill() will prevent disarming kprobes. Eventually, the ftrace_ops associated with the kprobes will be freed, yet the kprobes will still be active, and when triggered, they will use the freed memory, likely resulting in a page fault and panic. This behavior can be reproduced quite easily, by creating a kprobe and then triggering a ftrace_kill(). For simplicity, we can simulate an ftrace error with a kernel module like [1]: [1]: https://github.com/brenns10/kernel_stuff/tree/master/ftrace_killer sudo perf probe --add commit_creds sudo perf trace -e probe:commit_creds # In another terminal make sudo insmod ftrace_killer.ko # calls ftrace_kill(), simulating bug # Back to perf terminal # ctrl-c sudo perf probe --del commit_creds After a short period, a page fault and panic would occur as the kprobe continues to execute and uses the freed ftrace_ops. While ftrace_kill() is supposed to be used only in extreme circumstances, it is invoked in FTRACE_WARN_ON() and so there are many places where an unexpected bug could be triggered, yet the system may continue operating, possibly without the administrator noticing. If ftrace_kill() does not panic the system, then we should do everything we can to continue operating, rather than leave a ticking time bomb. Link: https://lore.kernel.org/all/20240501162956.229427-1-stephen.s.brennan@oracle.com/ Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Guo Ren <guoren@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-05-16Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-33/+0
Pull KVM updates from Paolo Bonzini: "ARM: - Move a lot of state that was previously stored on a per vcpu basis into a per-CPU area, because it is only pertinent to the host while the vcpu is loaded. This results in better state tracking, and a smaller vcpu structure. - Add full handling of the ERET/ERETAA/ERETAB instructions in nested virtualisation. The last two instructions also require emulating part of the pointer authentication extension. As a result, the trap handling of pointer authentication has been greatly simplified. - Turn the global (and not very scalable) LPI translation cache into a per-ITS, scalable cache, making non directly injected LPIs much cheaper to make visible to the vcpu. - A batch of pKVM patches, mostly fixes and cleanups, as the upstreaming process seems to be resuming. Fingers crossed! - Allocate PPIs and SGIs outside of the vcpu structure, allowing for smaller EL2 mapping and some flexibility in implementing more or less than 32 private IRQs. - Purge stale mpidr_data if a vcpu is created after the MPIDR map has been created. - Preserve vcpu-specific ID registers across a vcpu reset. - Various minor cleanups and improvements. LoongArch: - Add ParaVirt IPI support - Add software breakpoint support - Add mmio trace events support RISC-V: - Support guest breakpoints using ebreak - Introduce per-VCPU mp_state_lock and reset_cntx_lock - Virtualize SBI PMU snapshot and counter overflow interrupts - New selftests for SBI PMU and Guest ebreak - Some preparatory work for both TDX and SNP page fault handling. This also cleans up the page fault path, so that the priorities of various kinds of fauls (private page, no memory, write to read-only slot, etc.) are easier to follow. x86: - Minimize amount of time that shadow PTEs remain in the special REMOVED_SPTE state. This is a state where the mmu_lock is held for reading but concurrent accesses to the PTE have to spin; shortening its use allows other vCPUs to repopulate the zapped region while the zapper finishes tearing down the old, defunct page tables. - Advertise the max mappable GPA in the "guest MAXPHYADDR" CPUID field, which is defined by hardware but left for software use. This lets KVM communicate its inability to map GPAs that set bits 51:48 on hosts without 5-level nested page tables. Guest firmware is expected to use the information when mapping BARs; this avoids that they end up at a legal, but unmappable, GPA. - Fixed a bug where KVM would not reject accesses to MSR that aren't supposed to exist given the vCPU model and/or KVM configuration. - As usual, a bunch of code cleanups. x86 (AMD): - Implement a new and improved API to initialize SEV and SEV-ES VMs, which will also be extendable to SEV-SNP. The new API specifies the desired encryption in KVM_CREATE_VM and then separately initializes the VM. The new API also allows customizing the desired set of VMSA features; the features affect the measurement of the VM's initial state, and therefore enabling them cannot be done tout court by the hypervisor. While at it, the new API includes two bugfixes that couldn't be applied to the old one without a flag day in userspace or without affecting the initial measurement. When a SEV-ES VM is created with the new VM type, KVM_GET_REGS/KVM_SET_REGS and friends are rejected once the VMSA has been encrypted. Also, the FPU and AVX state will be synchronized and encrypted too. - Support for GHCB version 2 as applicable to SEV-ES guests. This, once more, is only accessible when using the new KVM_SEV_INIT2 flow for initialization of SEV-ES VMs. x86 (Intel): - An initial bunch of prerequisite patches for Intel TDX were merged. They generally don't do anything interesting. The only somewhat user visible change is a new debugging mode that checks that KVM's MMU never triggers a #VE virtualization exception in the guest. - Clear vmcs.EXIT_QUALIFICATION when synthesizing an EPT Misconfig VM-Exit to L1, as per the SDM. Generic: - Use vfree() instead of kvfree() for allocations that always use vcalloc() or __vcalloc(). - Remove .change_pte() MMU notifier - the changes to non-KVM code are small and Andrew Morton asked that I also take those through the KVM tree. The callback was only ever implemented by KVM (which was also the original user of MMU notifiers) but it had been nonfunctional ever since calls to set_pte_at_notify were wrapped with invalidate_range_start and invalidate_range_end... in 2012. Selftests: - Enhance the demand paging test to allow for better reporting and stressing of UFFD performance. - Convert the steal time test to generate TAP-friendly output. - Fix a flaky false positive in the xen_shinfo_test due to comparing elapsed time across two different clock domains. - Skip the MONITOR/MWAIT test if the host doesn't actually support MWAIT. - Avoid unnecessary use of "sudo" in the NX hugepage test wrapper shell script, to play nice with running in a minimal userspace environment. - Allow skipping the RSEQ test's sanity check that the vCPU was able to complete a reasonable number of KVM_RUNs, as the assert can fail on a completely valid setup. If the test is run on a large-ish system that is otherwise idle, and the test isn't affined to a low-ish number of CPUs, the vCPU task can be repeatedly migrated to CPUs that are in deep sleep states, which results in the vCPU having very little net runtime before the next migration due to high wakeup latencies. - Define _GNU_SOURCE for all selftests to fix a warning that was introduced by a change to kselftest_harness.h late in the 6.9 cycle, and because forcing every test to #define _GNU_SOURCE is painful. - Provide a global pseudo-RNG instance for all tests, so that library code can generate random, but determinstic numbers. - Use the global pRNG to randomly force emulation of select writes from guest code on x86, e.g. to help validate KVM's emulation of locked accesses. - Allocate and initialize x86's GDT, IDT, TSS, segments, and default exception handlers at VM creation, instead of forcing tests to manually trigger the related setup. Documentation: - Fix a goof in the KVM_CREATE_GUEST_MEMFD documentation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (225 commits) selftests/kvm: remove dead file KVM: selftests: arm64: Test vCPU-scoped feature ID registers KVM: selftests: arm64: Test that feature ID regs survive a reset KVM: selftests: arm64: Store expected register value in set_id_regs KVM: selftests: arm64: Rename helper in set_id_regs to imply VM scope KVM: arm64: Only reset vCPU-scoped feature ID regs once KVM: arm64: Reset VM feature ID regs from kvm_reset_sys_regs() KVM: arm64: Rename is_id_reg() to imply VM scope KVM: arm64: Destroy mpidr_data for 'late' vCPU creation KVM: arm64: Use hVHE in pKVM by default on CPUs with VHE support KVM: arm64: Fix hvhe/nvhe early alias parsing KVM: SEV: Allow per-guest configuration of GHCB protocol version KVM: SEV: Add GHCB handling for termination requests KVM: SEV: Add GHCB handling for Hypervisor Feature Support requests KVM: SEV: Add support to handle AP reset MSR protocol KVM: x86: Explicitly zero kvm_caps during vendor module load KVM: x86: Fully re-initialize supported_mce_cap on vendor module load KVM: x86: Fully re-initialize supported_vm_types on vendor module load KVM: x86/mmu: Sanity check that __kvm_faultin_pfn() doesn't create noslot pfns KVM: x86/mmu: Initialize kvm_page_fault's pfn and hva to error values ...
2024-05-16Merge tag 'libnvdimm-for-6.10' of ↵Linus Torvalds2-206/+2
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull nvdimm updates from Ira Weiny: "The changes include removing duplicate code and updating the nvdimm tree to the current kernel interfaces such as using const for struct device_type and changing the platform remove callback signature" * tag 'libnvdimm-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: remove redundant assignment to variable rc ndtest: Convert to platform remove callback returning void nvdimm/btt: always set max_integrity_segments nvdimm: remove nd_integrity_init dax: constify the struct device_type usage powerpc/papr_scm: Move duplicate definitions to common header files
2024-05-16Merge tag 'modules-6.10-rc1' of ↵Linus Torvalds9-68/+74
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull modules updates from Luis Chamberlain: "Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a non-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone" * tag 'modules-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of kprobes: remove dependency on CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate x86/ftrace: enable dynamic ftrace without CONFIG_MODULES arch: make execmem setup available regardless of CONFIG_MODULES powerpc: extend execmem_params for kprobes allocations arm64: extend execmem_info for generated code allocations riscv: extend execmem_params for generated code allocations mm/execmem, arch: convert remaining overrides of module_alloc to execmem mm/execmem, arch: convert simple overrides of module_alloc to execmem mm: introduce execmem_alloc() and execmem_free() module: make module_memory_{alloc,free} more self-contained sparc: simplify module_alloc() nios2: define virtual address space for modules mips: module: rename MODULE_START to MODULES_VADDR arm64: module: remove unneeded call to kasan_alloc_module_shadow() kallsyms: replace deprecated strncpy with strscpy module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.
2024-05-15Merge tag 'printk-for-6.10' of ↵Linus Torvalds5-5/+5
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk updates from Petr Mladek: - Use no_printk() instead of "if (0) printk()" constructs to avoid generating printk index for messages disabled at compile time - Remove deprecated strncpy/strcpy from printk.c - Remove redundant CONFIG_BASE_FULL in favor of CONFIG_BASE_SMALL * tag 'printk-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk: cleanup deprecated uses of strncpy/strcpy printk: Remove redundant CONFIG_BASE_FULL printk: Change type of CONFIG_BASE_SMALL to bool printk: Fix LOG_CPU_MAX_BUF_SHIFT when BASE_SMALL is enabled ceph: Use no_printk() helper dyndbg: Use *no_printk() helpers dev_printk: Add and use dev_no_printk() printk: Let no_printk() use _printk()
2024-05-14Merge tag 'irq-core-2024-05-12' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull interrupt subsystem updates from Thomas Gleixner: "Core code: - Interrupt storm detection for the lockup watchdog: Lockups which are caused by interrupt storms are not easy to debug because there is no information about the events which make the lockup detector trigger. To make this more user friendly, provide an extenstion to interrupt statistics which allows to take snapshots and an interface to retrieve the delta to the snapshot. Use this new mechanism in the watchdog code to do a two stage lockup analysis by taking the snapshot and printing the deltas for the topmost active interrupts on the second trigger. Note: This contains both the interrupt and the watchdog changes as the latter depend on the former obviously. - Avoid summation loops in the /proc/interrupts output and use the global counter when possible - Skip suspended interrupts on CPU hotplug operations to ensure that they are not delivered before the system resumes the device drivers when coming out of suspend. - On CPU hot-unplug interrupts which are affine to the outgoing CPU are migrated to a different CPU in the affinity mask. This can fail when the CPUs have no vectors left. Instead of giving up try to migrate it to any online CPU and thereby breaking the affinity setting in order to prevent a stale device interrupt which targets an offline CPU - The usual small cleanups Driver code: - Support for the RISCV AIA MSI controller - Make the interrupt allocation for the Loongson PCH controller more flexible to prevent vector exhaustion - The usual set of cleanups and fixes all over the place" * tag 'irq-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits) irqchip/gic-v3-its: Remove BUG_ON in its_vpe_irq_domain_alloc cpuidle: Avoid explicit cpumask allocation on stack irqchip/sifive-plic: Avoid explicit cpumask allocation on stack irqchip/riscv-aplic-direct: Avoid explicit cpumask allocation on stack irqchip/loongson-eiointc: Avoid explicit cpumask allocation on stack irqchip/gic-v3-its: Avoid explicit cpumask allocation on stack irqchip/irq-bcm6345-l1: Avoid explicit cpumask allocation on stack cpumask: Introduce cpumask_first_and_and() irqchip/irq-brcmstb-l2: Avoid saving mask on shutdown genirq: Reuse irq_is_nmi() genirq/cpuhotplug: Retry with cpu_online_mask when migration fails genirq/cpuhotplug: Skip suspended interrupts when restoring affinity arm64: dts: st: Add interrupt parent to pinctrl on stm32mp251 arm64: dts: st: Add exti1 and exti2 nodes on stm32mp251 ARM: dts: stm32: List exti parent interrupts on stm32mp131 ARM: dts: stm32: List exti parent interrupts on stm32mp151 arm64: Kconfig.platforms: Enable STM32_EXTI for ARCH_STM32 irqchip/stm32-exti: Mark events reserved with RIF configuration check irqchip/stm32-exti: Skip secure events irqchip/stm32-exti: Convert driver to standard PM ...
2024-05-14Merge tag 'timers-core-2024-05-12' of ↵Linus Torvalds1-15/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timers and timekeeping updates from Thomas Gleixner: "Core code: - Make timekeeping and VDSO time readouts resilent against math overflow: In guest context the kernel is prone to math overflow when the host defers the timer interrupt due to overload, malfunction or malice. This can be mitigated by checking the clocksource delta for the maximum deferrement which is readily available. If that value is exceeded then the code uses a slowpath function which can handle the multiplication overflow. This functionality is enabled unconditionally in the kernel, but made conditional in the VDSO code. The latter is conditional because it allows architectures to optimize the check so it is not causing performance regressions. On X86 this is achieved by reworking the existing check for negative TSC deltas as a negative delta obviously exceeds the maximum deferrement when it is evaluated as an unsigned value. That avoids two conditionals in the hotpath and allows to hide both the negative delta and the large delta handling in the same slow path. - Add an initial minimal ktime_t abstraction for Rust - The usual boring cleanups and enhancements Drivers: - Boring updates to device trees and trivial enhancements in various drivers" * tag 'timers-core-2024-05-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) clocksource/drivers/arm_arch_timer: Mark hisi_161010101_oem_info const clocksource/drivers/timer-ti-dm: Remove an unused field in struct dmtimer clocksource/drivers/renesas-ostm: Avoid reprobe after successful early probe clocksource/drivers/renesas-ostm: Allow OSTM driver to reprobe for RZ/V2H(P) SoC dt-bindings: timer: renesas: ostm: Document Renesas RZ/V2H(P) SoC rust: time: doc: Add missing C header links clocksource: Make the int help prompt unit readable in ncurses hrtimer: Rename __hrtimer_hres_active() to hrtimer_hres_active() timerqueue: Remove never used function timerqueue_node_expires() rust: time: Add Ktime vdso: Fix powerpc build U64_MAX undeclared error clockevents: Convert s[n]printf() to sysfs_emit() clocksource: Convert s[n]printf() to sysfs_emit() clocksource: Make watchdog and suspend-timing multiplication overflow safe timekeeping: Let timekeeping_cycles_to_ns() handle both under and overflow timekeeping: Make delta calculation overflow safe timekeeping: Prepare timekeeping_cycles_to_ns() for overflow safety timekeeping: Fold in timekeeping_delta_to_ns() timekeeping: Consolidate timekeeping helpers timekeeping: Refactor timekeeping helpers ...
2024-05-14Makefile: remove redundant tool coverage variablesMasahiro Yamada2-11/+0
Now Kbuild provides reasonable defaults for objtool, sanitizers, and profilers. Remove redundant variables. Note: This commit changes the coverage for some objects: - include arch/mips/vdso/vdso-image.o into UBSAN, GCOV, KCOV - include arch/sparc/vdso/vdso-image-*.o into UBSAN - include arch/sparc/vdso/vma.o into UBSAN - include arch/x86/entry/vdso/extable.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vdso-image-*.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vdso32-setup.o into KASAN, KCSAN, UBSAN, GCOV, KCOV - include arch/x86/entry/vdso/vma.o into GCOV, KCOV - include arch/x86/um/vdso/vma.o into KASAN, GCOV, KCOV I believe these are positive effects because all of them are kernel space objects. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Roberto Sassu <roberto.sassu@huawei.com>
2024-05-14powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriateMike Rapoport (IBM)6-9/+9
There are places where CONFIG_MODULES guards the code that depends on memory allocation being done with module_alloc(). Replace CONFIG_MODULES with CONFIG_EXECMEM in such places. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14arch: make execmem setup available regardless of CONFIG_MODULESMike Rapoport (IBM)2-63/+64
execmem does not depend on modules, on the contrary modules use execmem. To make execmem available when CONFIG_MODULES=n, for instance for kprobes, split execmem_params initialization out from arch/*/kernel/module.c and compile it when CONFIG_EXECMEM=y Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14powerpc: extend execmem_params for kprobes allocationsMike Rapoport (IBM)2-20/+7
powerpc overrides kprobes::alloc_insn_page() to remove writable permissions when STRICT_MODULE_RWX is on. Add definition of EXECMEM_KRPOBES to execmem_params to allow using the generic kprobes::alloc_insn_page() with the desired permissions. As powerpc uses breakpoint instructions to inject kprobes, it does not need to constrain kprobe allocations to the modules area and can use the entire vmalloc address space. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14mm/execmem, arch: convert remaining overrides of module_alloc to execmemMike Rapoport (IBM)1-21/+39
Extend execmem parameters to accommodate more complex overrides of module_alloc() by architectures. This includes specification of a fallback range required by arm, arm64 and powerpc, EXECMEM_MODULE_DATA type required by powerpc, support for allocation of KASAN shadow required by s390 and x86 and support for late initialization of execmem required by arm64. The core implementation of execmem_alloc() takes care of suppressing warnings when the initial allocation fails but there is a fallback range defined. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Will Deacon <will@kernel.org> Acked-by: Song Liu <song@kernel.org> Tested-by: Liviu Dudau <liviu@dudau.co.uk> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2024-05-14mm: introduce execmem_alloc() and execmem_free()Mike Rapoport (IBM)1-3/+3
module_alloc() is used everywhere as a mean to allocate memory for code. Beside being semantically wrong, this unnecessarily ties all subsystems that need to allocate code, such as ftrace, kprobes and BPF to modules and puts the burden of code allocation to the modules code. Several architectures override module_alloc() because of various constraints where the executable memory can be located and this causes additional obstacles for improvements of code allocation. Start splitting code allocation from modules by introducing execmem_alloc() and execmem_free() APIs. Initially, execmem_alloc() is a wrapper for module_alloc() and execmem_free() is a replacement of module_memfree() to allow updating all call sites to use the new APIs. Since architectures define different restrictions on placement, permissions, alignment and other parameters for memory that can be used by different subsystems that allocate executable memory, execmem_alloc() takes a type argument, that will be used to identify the calling subsystem and to allow architectures define parameters for ranges suitable for that subsystem. No functional changes. Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Acked-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>