summaryrefslogtreecommitdiff
path: root/arch/x86/lib
AgeCommit message (Collapse)AuthorFilesLines
9 daysx86/insn: Fix CTEST instruction decodingKirill A. Shutemov1-2/+2
insn_decoder_test found a problem with decoding APX CTEST instructions: Found an x86 instruction decoder bug, please report this. ffffffff810021df 62 54 94 05 85 ff ctestneq objdump says 6 bytes, but insn_get_length() says 5 It happens because x86-opcode-map.txt doesn't specify arguments for the instruction and the decoder doesn't expect to see ModRM byte. Fixes: 690ca3a3067f ("x86/insn: Add support for APX EVEX instructions to the opcode map") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: stable@vger.kernel.org # v6.10+ Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
2025-03-29x86/uaccess: Improve performance by aligning writes to 8 bytes in ↵Herton R. Krzesinski1-0/+18
copy_user_generic(), on non-FSRM/ERMS CPUs History of the performance regression: ====================================== Since the following series of user copy updates were merged upstream ~2 years ago via: a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications") .. copy_user_generic() on x86_64 stopped doing alignment of the writes to the destination to a 8 byte boundary for the non FSRM case. Previously, this was done through the ALIGN_DESTINATION macro that was used in the now removed copy_user_generic_unrolled function. Turns out this change causes some loss of performance/throughput on some use cases and specific CPU/platforms without FSRM and ERMS. Lately I got two reports of performance/throughput issues after a RHEL 9 kernel pulled the same upstream series with updates to user copy functions. Both reports consisted of running specific networking/TCP related testing using iperf3. Partial upstream fix ==================== The first report was related to a Linux Bridge testing using VMs on a specific machine with an AMD CPU (EPYC 7402), and after a brief investigation it turned out that the later change via: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") ... helped/fixed the performance issue. However, after the later commit/fix was applied, then I got another regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also running on an AMD based platform (AMD EPYC 7302 CPU), again that was using iperf3 to run the test. That regression was after applying the later fix/commit, but only this didn't help in telling the whole history. Testing performed to pinpoint residual regression ================================================= So I narrowed down the second regression use case, but running it without traffic through a NIC, on localhost, in trying to narrow down CPU usage and not being limited by other factor like network bandwidth. I used another system also with an AMD CPU (AMD EPYC 7742). Basically, I run iperf3 in server and client mode in the same system, for example: - Start the server binding it to CPU core/thread 19: $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000 - Start the client always binding/running on CPU core/thread 17, using perf to get statistics: $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \ -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \ > stat-19.txt For the client, always running/pinned to CPU 17. But for the iperf3 in server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any specific CPU. So it basically consisted with four runs of the same commands, just changing the CPU which the server is pinned, or without pinning by removing the taskset call before the server command. The CPUs were chosen based on NUMA node they were on, this is the relevant output of lscpu on the system: $ lscpu ... Model name: AMD EPYC 7742 64-Core Processor ... Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123 NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125 NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127 ... So for the server run, when picking a CPU, I chose CPUs to be not on the same node. The reason is with that I was able to get/measure relevant performance differences when changing the alignment of the writes to the destination in copy_user_generic. Testing shows up to +81% performance improvement under iperf3 ============================================================= Here's a summary of the iperf3 runs: # Vanilla upstream alignment: CPU RATE SYS TIME sender-receiver Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8% Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9% Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5% Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5% # With the attached patch (aligning writes in non ERMS/FSRM case): CPU RATE SYS TIME sender-receiver Server bind 19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0% Server bind 21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7% Server bind 23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8% Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6% So I consistently got better results when aligning the write. The results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys time and then the total time to run/transfer 50G of data. The last field is the CPU usage of sender/receiver iperf3 process. It's also worth to note that each pair of iperf3 runs may get slightly different results on each run, but I always got consistent higher results with the write alignment for this specific test of running the processes on CPUs in different NUMA nodes. Linus Torvalds helped/provided this version of the patch. Initially I proposed a version which aligned writes for all cases in rep_movs_alternative, however it used two extra registers and thus Linus provided an enhanced version that only aligns the write on the large_movsq case, which is sufficient since the problem happens only on those AMD CPUs like ones mentioned above without ERMS/FSRM, and also doesn't require using extra registers. Also, I validated that aligning only on large_movsq case is really enough for getting the performance back. I also tested this patch on an old Intel based non-ERMS/FRMS system (with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems: no performance enhancement but also no regression either, using the same iperf3 based benchmark. Also newer Intel processors after Sandy Bridge usually have ERMS and should not be affected by this change. [ mingo: Updated the changelog. ] Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") Reported-by: Ondrej Lichtner <olichtne@redhat.com> Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
2025-03-26Merge tag 'crc-for-linus' of ↵Linus Torvalds11-603/+955
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: "Another set of improvements to the kernel's CRC (cyclic redundancy check) code: - Rework the CRC64 library functions to be directly optimized, like what I did last cycle for the CRC32 and CRC-T10DIF library functions - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ support and acceleration for crc64_be and crc64_nvme - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for crc_t10dif, crc64_be, and crc64_nvme - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they are no longer needed there - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect - Add kunit test cases for crc64_nvme and crc7 - Eliminate redundant functions for calculating the Castagnoli CRC32, settling on just crc32c() - Remove unnecessary prompts from some of the CRC kconfig options - Further optimize the x86 crc32c code" * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits) x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512 lib/crc: remove unnecessary prompt for CONFIG_CRC64 lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C lib/crc: remove unnecessary prompt for CONFIG_CRC8 lib/crc: remove unnecessary prompt for CONFIG_CRC7 lib/crc: remove unnecessary prompt for CONFIG_CRC4 lib/crc7: unexport crc7_be_syndrome_table lib/crc_kunit.c: update comment in crc_benchmark() lib/crc_kunit.c: add test and benchmark for crc7_be() x86/crc32: optimize tail handling for crc32c short inputs riscv/crc64: add Zbc optimized CRC64 functions riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function riscv/crc32: reimplement the CRC32 functions using new template riscv/crc: add "template" for Zbc optimized CRC functions x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings x86/crc32: improve crc32c_arch() code generation with clang x86/crc64: implement crc64_be and crc64_nvme using new template x86/crc-t10dif: implement crc_t10dif using new template x86/crc32: implement crc32_le using new template x86/crc: add "template" for [V]PCLMULQDQ based CRC functions ...
2025-03-25Merge tag 'x86_bugs_for_v6.15' of ↵Linus Torvalds1-0/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 speculation mitigation updates from Borislav Petkov: - Some preparatory work to convert the mitigations machinery to mitigating attack vectors instead of single vulnerabilities - Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag - Add support for a Zen5-specific SRSO mitigation - Cleanups and minor improvements * tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2 x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds x86/bugs: Relocate mds/taa/mmio/rfds defines x86/bugs: Add X86_BUG_SPECTRE_V2_USER x86/bugs: Remove X86_FEATURE_USE_IBPB KVM: nVMX: Always use IBPB to properly virtualize IBRS x86/bugs: Use a static branch to guard IBPB on vCPU switch x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set() x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation() x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers x86/bugs: KVM: Add support for SRSO_MSR_FIX
2025-03-25Merge tag 'x86-cleanups-2025-03-22' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo" * tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function x86/delay: Fix inconsistent whitespace selftests/x86/syscall: Fix coccinelle WARNING recommending the use of ARRAY_SIZE() x86/platform: Fix missing declaration of 'x86_apple_machine' x86/irq: Fix missing declaration of 'io_apic_irqs' x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s description x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()
2025-03-19x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512Eric Biggers2-25/+13
Intel made a late change to the AVX10 specification that removes support for a 256-bit maximum vector length and enumeration of the maximum vector length. AVX10 will imply a maximum vector length of 512 bits. I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will just be AVX10, and it will essentially just consolidate AVX512 features. As a result of this new development, my strategy of providing both *_avx10_256 and *_avx10_512 functions didn't turn out to be that useful. The only remaining motivation for the 256-bit AVX512 / AVX10 functions is to avoid downclocking on older Intel CPUs. But I already wrote *_avx2 code too (primarily to support CPUs without AVX512), which performs almost as well as *_avx10_256. So we should just use that. Therefore, remove the *_avx10_256 CRC functions, and rename the *_avx10_512 CRC functions to *_avx512. Make Ice Lake and Tiger Lake use the *_avx2 functions instead of *_avx10_256 which they previously used. Link: https://lore.kernel.org/r/20250319181316.91271-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-03-19x86/runtime-const: Add the RUNTIME_CONST_PTR assembly macroKirill A. Shutemov1-5/+2
Add an assembly macro to refer runtime cost. It hides linker magic and makes assembly more readable. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Brian Gerst <brgerst@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250304153342.2016569-1-kirill.shutemov@linux.intel.com
2025-03-10x86/crc32: optimize tail handling for crc32c short inputsEric Biggers1-1/+9
For handling the 0 <= len < sizeof(unsigned long) bytes left at the end, do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows taking advantage of wider CRC instructions. Note that crc32c-3way.S already uses this same optimization too. crc_kunit shows an improvement of about 25% for len=127. Suggested-by: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Uros Bizjak <ubizjak@gmail.com> Link: https://lore.kernel.org/r/20250304213216.108925-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-03-05x86/delay: Fix inconsistent whitespaceCharles Han1-1/+1
Smatch warns about this whitespace damage: arch/x86/lib/delay.c:134 delay_halt_mwaitx() warn: inconsistent indenting Signed-off-by: Charles Han <hanchunchao@inspur.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250305063515.3951-1-hanchunchao@inspur.com
2025-03-04x86/retbleed: Move call depth to percpu hot sectionBrian Gerst1-1/+1
No functional change. Signed-off-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Uros Bizjak <ubizjak@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250303165246.2175811-6-brgerst@gmail.com
2025-03-04Merge branch 'x86/asm' into x86/core, to pick up dependent commitsIngo Molnar2-2/+2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-28x86/cpufeatures: Rename X86_CMPXCHG64 to X86_CX8H. Peter Anvin (Intel)2-2/+2
Replace X86_CMPXCHG64 with X86_CX8, as CX8 is the name of the CPUID flag, thus to make it consistent with X86_FEATURE_CX8 defined in <asm/cpufeatures.h>. No functional change intended. Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com> Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250228082338.73859-2-xin@zytor.com
2025-02-26x86/bugs: KVM: Add support for SRSO_MSR_FIXBorislav Petkov1-0/+2
Add support for CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate SRSO. Enable BpSpecReduce to mitigate SRSO across guest/host boundaries. Switch back to enabling the bit when virtualization is enabled and to clear the bit when virtualization is disabled because using a MSR slot would clear the bit when the guest is exited and any training the guest has done, would potentially influence the host kernel when execution enters the kernel and hasn't VMRUN the guest yet. More detail on the public thread in Link below. Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
2025-02-26x86/bhi: Add BHI stubsPeter Zijlstra2-1/+149
Add an array of code thunks, to be called from the FineIBT preamble, clobbering the first 'n' argument registers for speculative execution. Notably the 0th entry will clobber no argument registers and will never be used, it exists so the array can be naturally indexed, while the 7th entry will clobber all the 6 argument registers and also RSP in order to mess up stack based arguments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Kees Cook <kees@kernel.org> Link: https://lore.kernel.org/r/20250224124200.717378681@infradead.org
2025-02-23x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s ↵Randy Dunlap1-1/+1
description Use @addr instead of @vaddr in the kernel-doc comment for clean_cache_range() to eliminate warnings: arch/x86/lib/usercopy_64.c:29: warning: Function parameter or struct member 'addr' not described in 'clean_cache_range' arch/x86/lib/usercopy_64.c:29: warning: Excess function parameter 'vaddr' description in 'clean_cache_range' Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250111063333.911084-1-rdunlap@infradead.org
2025-02-19x86/crc: add ANNOTATE_NOENDBR to suppress objtool warningsEric Biggers1-0/+5
The assembly functions generated by crc-pclmul-template.S are called only via static_call, so they do not need to begin with an endbr instruction. But objtool still warns about a missing endbr by default. Add ANNOTATE_NOENDBR to suppress these warnings: vmlinux.o: warning: objtool: crc32_x86_init+0x1c0: relocation to !ENDBR: crc32_lsb_vpclmul_avx10_256+0x0 vmlinux.o: warning: objtool: crc64_x86_init+0x183: relocation to !ENDBR: crc64_msb_vpclmul_avx10_256+0x0 vmlinux.o: warning: objtool: crc_t10dif_x86_init+0x183: relocation to !ENDBR: crc16_msb_vpclmul_avx10_256+0x0 vmlinux.o: warning: objtool: __SCK__crc32_lsb_pclmul+0x0: data relocation to !ENDBR: crc32_lsb_pclmul_sse+0x0 vmlinux.o: warning: objtool: __SCK__crc64_lsb_pclmul+0x0: data relocation to !ENDBR: crc64_lsb_pclmul_sse+0x0 vmlinux.o: warning: objtool: __SCK__crc64_msb_pclmul+0x0: data relocation to !ENDBR: crc64_msb_pclmul_sse+0x0 vmlinux.o: warning: objtool: __SCK__crc16_msb_pclmul+0x0: data relocation to !ENDBR: crc16_msb_pclmul_sse+0x0 Fixes: 8d2d3e72e35b ("x86/crc: add "template" for [V]PCLMULQDQ based CRC functions") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Closes: https://lore.kernel.org/r/20250217170555.3d14df62@canb.auug.org.au/ Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20250217193230.100443-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-14x86/cfi: Clean up linkagePeter Zijlstra7-0/+29
With the introduction of kCFI the addition of ENDBR to SYM_FUNC_START* no longer suffices to make the function indirectly callable. This now requires the use of SYM_TYPED_FUNC_START. As such, remove the implicit ENDBR from SYM_FUNC_START* and add some explicit annotations to fix things up again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org
2025-02-14x86,kcfi: Fix EXPORT_SYMBOL vs kCFIPeter Zijlstra5-7/+12
The expectation is that all EXPORT'ed symbols are free to have their address taken and called indirectly. The majority of the assembly defined functions currently violate this expectation. Make then all use SYM_TYPED_FUNC_START() in order to emit the proper kCFI preamble. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20250207122546.302679189@infradead.org
2025-02-12x86/crc32: improve crc32c_arch() code generation with clangEric Biggers1-2/+2
crc32c_arch() is affected by https://github.com/llvm/llvm-project/issues/20571 where clang unnecessarily spills the inputs to "rm"-constrained operands to the stack. Replace "rm" with ASM_INPUT_RM which partially works around this by expanding to "r" when the compiler is clang. This results in better code generation with clang, though still not optimal. Link: https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-10x86/crc64: implement crc64_be and crc64_nvme using new templateEric Biggers4-1/+157
Add x86_64 [V]PCLMULQDQ optimized implementations of crc64_be() and crc64_nvme() by wiring them up to crc-pclmul-template.S. crc64_be() is used by bcache and bcachefs, and crc64_nvme() is used by blk-integrity. Both features can CRC large amounts of data, and the developers of both features have expressed interest in having these CRCs be optimized. So this optimization should be worthwhile. (See https://lore.kernel.org/r/v36sousjd5ukqlkpdxslvpu7l37zbu7d7slgc2trjjqwty2bny@qgzew34feo2r and https://lore.kernel.org/r/20220222163144.1782447-11-kbusch@kernel.org) Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit: crc64_be: Length Before After ------ ------ ----- 1 633 MB/s 477 MB/s 16 717 MB/s 2517 MB/s 64 715 MB/s 7525 MB/s 127 714 MB/s 10002 MB/s 128 713 MB/s 13344 MB/s 200 715 MB/s 15752 MB/s 256 714 MB/s 22933 MB/s 511 715 MB/s 28025 MB/s 512 714 MB/s 49772 MB/s 1024 715 MB/s 65261 MB/s 3173 714 MB/s 78773 MB/s 4096 714 MB/s 83315 MB/s 16384 714 MB/s 89487 MB/s crc64_nvme: Length Before After ------ ------ ----- 1 716 MB/s 474 MB/s 16 717 MB/s 3303 MB/s 64 713 MB/s 7940 MB/s 127 715 MB/s 9867 MB/s 128 714 MB/s 13698 MB/s 200 715 MB/s 15995 MB/s 256 714 MB/s 23479 MB/s 511 714 MB/s 28013 MB/s 512 715 MB/s 51533 MB/s 1024 715 MB/s 66788 MB/s 3173 715 MB/s 79182 MB/s 4096 715 MB/s 83966 MB/s 16384 715 MB/s 89739 MB/s Acked-by: Keith Busch <kbusch@kernel.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250210174540.161705-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-10x86/crc-t10dif: implement crc_t10dif using new templateEric Biggers5-348/+63
Instantiate crc-pclmul-template.S for crc_t10dif and delete the original PCLMULQDQ optimized implementation. This has the following advantages: - Less CRC-variant-specific code. - VPCLMULQDQ support, greatly improving performance on sufficiently long messages on newer CPUs. - A faster reduction from 128 bits to the final CRC. - Support for i386. Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit: Length Before After ------ ------ ----- 1 440 MB/s 386 MB/s 16 1865 MB/s 2008 MB/s 64 4343 MB/s 6917 MB/s 127 5440 MB/s 8909 MB/s 128 5533 MB/s 12150 MB/s 200 5908 MB/s 14423 MB/s 256 15870 MB/s 21288 MB/s 511 14219 MB/s 25840 MB/s 512 18361 MB/s 37797 MB/s 1024 19941 MB/s 61374 MB/s 3173 20461 MB/s 74909 MB/s 4096 21310 MB/s 78919 MB/s 16384 21663 MB/s 85012 MB/s Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Keith Busch <kbusch@kernel.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250210174540.161705-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-10x86/crc32: implement crc32_le using new templateEric Biggers3-244/+65
Instantiate crc-pclmul-template.S for crc32_le, and delete the original PCLMULQDQ optimized implementation. This has the following advantages: - Less CRC-variant-specific code. - VPCLMULQDQ support, greatly improving performance on sufficiently long messages on newer CPUs. - A faster reduction from 128 bits to the final CRC. - Support for lengths not a multiple of 16 bytes, improving performance for such lengths. - Support for misaligned buffers, improving performance in such cases. Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit: Length Before After ------ ------ ----- 1 427 MB/s 605 MB/s 16 710 MB/s 3631 MB/s 64 704 MB/s 7615 MB/s 127 3610 MB/s 9710 MB/s 128 8759 MB/s 12702 MB/s 200 7083 MB/s 15343 MB/s 256 17284 MB/s 22904 MB/s 511 10919 MB/s 27309 MB/s 512 19849 MB/s 48900 MB/s 1024 21216 MB/s 62630 MB/s 3173 22150 MB/s 72437 MB/s 4096 22496 MB/s 79593 MB/s 16384 22018 MB/s 85106 MB/s Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Keith Busch <kbusch@kernel.org> Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250210174540.161705-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-10x86/crc: add "template" for [V]PCLMULQDQ based CRC functionsEric Biggers2-0/+665
The Linux kernel implements many variants of CRC, such as crc16, crc_t10dif, crc32_le, crc32c, crc32_be, crc64_nvme, and crc64_be. On x86, except for crc32c which has special scalar instructions, the fastest way to compute any of these CRCs on any message of length roughly >= 16 bytes is to use the SIMD carryless multiplication instructions PCLMULQDQ or VPCLMULQDQ. Depending on the available CPU features this can mean PCLMULQDQ+SSE4.1, VPCLMULQDQ+AVX2, VPCLMULQDQ+AVX10/256, or VPCLMULQDQ+AVX10/512 (or the AVX512 equivalents to AVX10/*). This results in a total of 20+ CRC implementations being potentially needed to properly optimize all CRCs that someone cares about for x86. Besides crc32c, currently only crc32_le and crc_t10dif are actually optimized for x86, and they only use PCLMULQDQ, which means they can be 2-4x slower than what is possible with VPCLMULQDQ. Fortunately, at a high level the code that is needed for any [V]PCLMULQDQ based CRC implementation is mostly the same. Therefore, this patch introduces an assembly macro that expands into the body of a [V]PCLMULQDQ based CRC function for a given number of bits (8, 16, 32, or 64), bit order (lsb or msb-first), vector length, and AVX level. The function expects to be passed a constants table, specific to the polynomial desired, that was generated by the script previously added. When two CRC variants share the same number of bits and bit order, the same functions can be reused, with only the constants table differing. A new C header is also added to make it easy to integrate the new assembly code using a static call. The result is that it becomes straightforward to wire up an optimized implementation of any CRC-8, CRC-16, CRC-32, or CRC-64 for x86. Later patches will wire up specific CRC variants. Although this new template allows easily generating many functions, care was taken to still keep the binary size fairly low. Each generated function is only 550 to 850 bytes depending on the CRC variant and target CPU features. And only one function per CRC variant is actually used at runtime (since all functions support all lengths >= 16 bytes). Note that a similar approach should also work for other architectures that have carryless multiplication instructions, such as arm64. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250210174540.161705-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-09lib/crc-t10dif: remove crc_t10dif_is_optimized()Eric Biggers1-6/+0
With the "crct10dif" algorithm having been removed from the crypto API, crc_t10dif_is_optimized() is no longer used. Acked-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250208175647.12333-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-09lib/crc32: remove "_le" from crc32c base and arch functionsEric Biggers1-3/+3
Following the standardization on crc32c() as the lib entry point for the Castagnoli CRC32 instead of the previous mix of crc32c(), crc32c_le(), and __crc32c_le(), make the same change to the underlying base and arch functions that implement it. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20250208024911.14936-7-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-01-23Merge tag 'crc-for-linus' of ↵Linus Torvalds6-0/+1091
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull CRC updates from Eric Biggers: - Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be directly accessible via the library API, instead of requiring the crypto API. This is much simpler and more efficient. - Convert some users such as ext4 to use the CRC32 library API instead of the crypto API. More conversions like this will come later. - Add a KUnit test that tests and benchmarks multiple CRC variants. Remove older, less-comprehensive tests that are made redundant by this. - Add an entry to MAINTAINERS for the kernel's CRC library code. I'm volunteering to maintain it. I have additional cleanups and optimizations planned for future cycles. * tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits) MAINTAINERS: add entry for CRC library powerpc/crc: delete obsolete crc-vpmsum_test.c lib/crc32test: delete obsolete crc32test.c lib/crc16_kunit: delete obsolete crc16_kunit.c lib/crc_kunit.c: add KUnit test suite for CRC library functions powerpc/crc-t10dif: expose CRC-T10DIF function through lib arm64/crc-t10dif: expose CRC-T10DIF function through lib arm/crc-t10dif: expose CRC-T10DIF function through lib x86/crc-t10dif: expose CRC-T10DIF function through lib crypto: crct10dif - expose arch-optimized lib function lib/crc-t10dif: add support for arch overrides lib/crc-t10dif: stop wrapping the crypto API scsi: target: iscsi: switch to using the crc32c library f2fs: switch to using the crc32 library jbd2: switch to using the crc32c library ext4: switch to using the crc32c library lib/crc32: make crc32c() go directly to lib bcachefs: Explicitly select CRYPTO from BCACHEFS_FS x86/crc32: expose CRC32 functions through lib x86/crc32: update prototype for crc32_pclmul_le_16() ...
2025-01-20x86: use cmov for user address maskingLinus Torvalds1-3/+2
This was a suggestion by David Laight, and while I was slightly worried that some micro-architecture would predict cmov like a conditional branch, there is little reason to actually believe any core would be that broken. Intel documents that their existing cores treat CMOVcc as a data dependency that will constrain speculation in their "Speculative Execution Side Channel Mitigations" whitepaper: "Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also be used to prevent bounds check bypass by constraining speculative execution on current family 6 processors (Intel® Core™, Intel® Atom™, Intel® Xeon® and Intel® Xeon Phi™ processors)" and while that leaves the future uarch issues open, that's certainly true of our traditional SBB usage too. Any core that predicts CMOV will be unusable for various crypto algorithms that need data-independent timing stability, so let's just treat CMOV as the safe choice that simplifies the address masking by avoiding an extra instruction and doesn't need a temporary register. Suggested-by: David Laight <David.Laight@aculab.com> Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-12-02x86/crc-t10dif: expose CRC-T10DIF function through libEric Biggers3-0/+386
Move the x86 CRC-T10DIF assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-12-02x86/crc32: expose CRC32 functions through libEric Biggers4-0/+705
Move the x86 CRC32 assembly code into the lib directory and wire it up to the library interface. This allows it to be used without going through the crypto API. It remains usable via the crypto API too via the shash algorithms that use the library interface. Thus all the arch-specific "shash" code becomes unnecessary and is removed. Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2024-10-25x86: fix user address masking non-canonical speculation issueLinus Torvalds1-2/+7
It turns out that AMD has a "Meltdown Lite(tm)" issue with non-canonical accesses in kernel space. And so using just the high bit to decide whether an access is in user space or kernel space ends up with the good old "leak speculative data" if you have the right gadget using the result: CVE-2020-12965 “Transient Execution of Non-Canonical Accesses“ Now, the kernel surrounds the access with a STAC/CLAC pair, and those instructions end up serializing execution on older Zen architectures, which closes the speculation window. But that was true only up until Zen 5, which renames the AC bit [1]. That improves performance of STAC/CLAC a lot, but also means that the speculation window is now open. Note that this affects not just the new address masking, but also the regular valid_user_address() check used by access_ok(), and the asm version of the sign bit check in the get_user() helpers. It does not affect put_user() or clear_user() variants, since there's no speculative result to be used in a gadget for those operations. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/all/80d94591-1297-4afb-b510-c665efd37f10@citrix.com/ Link: https://lore.kernel.org/all/20241023094448.GAZxjFkEOOF_DM83TQ@fat_crate.local/ [1] Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html Link: https://arxiv.org/pdf/2108.10771 Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> # LAM case Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") Fixes: 6014bc27561f ("x86-64: make access_ok() independent of LAM") Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-03move asm/unaligned.h to linux/unaligned.hAl Viro1-1/+1
asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-09-29Merge branch 'locking/core' into locking/urgent, to pick up pending commitsIngo Molnar1-2/+7
Merge all pending locking commits into a single branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-08-01x86/uaccess: Zero the 8-byte get_range case on failure on 32-bitDavid Gow1-1/+3
While zeroing the upper 32 bits of an 8-byte getuser on 32-bit x86 was fixed by commit 8c860ed825cb ("x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checking") it was broken again in commit 8a2462df1547 ("x86/uaccess: Improve the 8-byte getuser() case"). This is because the register which holds the upper 32 bits (%ecx) is being cleared _after_ the check_range, so if the range check fails, %ecx is never cleared. This can be reproduced with: ./tools/testing/kunit/kunit.py run --arch i386 usercopy Instead, clear %ecx _before_ check_range in the 8-byte case. This reintroduces a bit of the ugliness we were trying to avoid by adding another #ifndef CONFIG_X86_64, but at least keeps check_range from needing a separate bad_get_user_8 jump. Fixes: 8a2462df1547 ("x86/uaccess: Improve the 8-byte getuser() case") Signed-off-by: David Gow <davidgow@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/all/20240731073031.4045579-1-davidgow@google.com
2024-07-31x86/setup: Parse the builtin command line before mergingBorislav Petkov (AMD)1-7/+18
Commit in Fixes was added as a catch-all for cases where the cmdline is parsed before being merged with the builtin one. And promptly one issue appeared, see Link below. The microcode loader really needs to parse it that early, but the merging happens later. Reshuffling the early boot nightmare^W code to handle that properly would be a painful exercise for another day so do the chicken thing and parse the builtin cmdline too before it has been merged. Fixes: 0c40b1c7a897 ("x86/setup: Warn when option parsing is done too early") Reported-by: Mike Lothian <mike@fireburn.co.uk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local
2024-07-17locking/atomic/x86: Introduce the read64_nonatomic macro to x86_32 with cx8Uros Bizjak1-2/+7
As described in commit: e73c4e34a0e9 ("locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32") the value preload before the CMPXCHG loop does not need to be atomic. Introduce the read64_nonatomic assembly macro to load the value from a atomic_t location in a faster non-atomic way and use it in atomic64_cx8_32.S. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240605181424.3228-1-ubizjak@gmail.com
2024-07-16Merge tag 'x86_misc_for_v6.11_rc1' of ↵Linus Torvalds1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - Make error checking of AMD SMN accesses more robust in the callers as they're the only ones who can interpret the results properly - The usual cleanups and fixes, left and right * tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kmsan: Fix hook for unaligned accesses x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos x86/pci/xen: Fix PCIBIOS_* return code handling x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling x86/of: Return consistent error type from x86_of_pci_irq_enable() hwmon: (k10temp) Rename _data variable hwmon: (k10temp) Remove unused HAVE_TDIE() macro hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters hwmon: (k10temp) Define a helper function to read CCD temperature x86/amd_nb: Enhance SMN access error checking hwmon: (k10temp) Check return value of amd_smn_read() EDAC/amd64: Check return value of amd_smn_read() EDAC/amd64: Remove unused register accesses tools/x86/kcpuid: Add missing dir via Makefile x86, arm: Add missing license tag to syscall tables files
2024-07-16Merge tag 'x86_core_for_v6.11_rc1' of ↵Linus Torvalds1-49/+20
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 uaccess update from Borislav Petkov: - Cleanup the 8-byte getuser() asm case * tag 'x86_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/uaccess: Improve the 8-byte getuser() case
2024-07-16Merge tag 'x86_boot_for_v6.11_rc1' of ↵Linus Torvalds1-0/+8
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 boot updates from Borislav Petkov: - Add a check to warn when cmdline parsing happens before the final cmdline string has been built and thus arguments can get lost - Code cleanups and simplifications * tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/setup: Warn when option parsing is done too early x86/boot: Clean up the arch/x86/boot/main.c code a bit x86/boot: Use current_stack_pointer to avoid asm() in init_heap()
2024-06-25x86/kmsan: Fix hook for unaligned accessesBrian Johannesmeyer1-1/+4
When called with a 'from' that is not 4-byte-aligned, string_memcpy_fromio() calls the movs() macro to copy the first few bytes, so that 'from' becomes 4-byte-aligned before calling rep_movs(). This movs() macro modifies 'to', and the subsequent line modifies 'n'. As a result, on unaligned accesses, kmsan_unpoison_memory() uses the updated (aligned) values of 'to' and 'n'. Hence, it does not unpoison the entire region. Save the original values of 'to' and 'n', and pass those to kmsan_unpoison_memory(), so that the entire region is unpoisoned. Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20240523215029.4160518-1-bjohannesmeyer@gmail.com
2024-06-19x86/uaccess: Improve the 8-byte getuser() caseLinus Torvalds1-49/+20
Streamline the 8-byte case and drop the special handling. Use a macro which hides the exception handling. No functional changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/CAHk-=whYb2L_atsRk9pBiFiVLGe5wNZLHhRinA69yu6FiKvDsw@mail.gmail.com
2024-06-12x86/uaccess: Fix missed zeroing of ia32 u64 get_user() range checkingKees Cook1-1/+5
When reworking the range checking for get_user(), the get_user_8() case on 32-bit wasn't zeroing the high register. (The jump to bad_get_user_8 was accidentally dropped.) Restore the correct error handling destination (and rename the jump to using the expected ".L" prefix). While here, switch to using a named argument ("size") for the call template ("%c4" to "%c[size]") as already used in the other call templates in this file. Found after moving the usercopy selftests to KUnit: # usercopy_test_invalid: EXPECTATION FAILED at lib/usercopy_kunit.c:278 Expected val_u64 == 0, but val_u64 == -60129542144 (0xfffffff200000000) Closes: https://lore.kernel.org/all/CABVgOSn=tb=Lj9SxHuT4_9MTjjKVxsq-ikdXC4kGHO4CfKVmGQ@mail.gmail.com Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Reported-by: David Gow <davidgow@google.com> Signed-off-by: Kees Cook <kees@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Tested-by: David Gow <davidgow@google.com> Link: https://lore.kernel.org/all/20240610210213.work.143-kees%40kernel.org
2024-05-27x86/setup: Warn when option parsing is done too earlyBorislav Petkov (AMD)1-0/+8
Commit 4faa0e5d6d79 ("x86/boot: Move kernel cmdline setup earlier in the boot process (again)") fixed and issue where cmdline parsing would happen before the final boot_command_line string has been built from the builtin and boot cmdlines and thus cmdline arguments would get lost. Add a check to catch any future wrong use ordering so that such issues can be caught in time. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240409152541.GCZhVd9XIPXyTNd9vc@fat_crate.local
2024-05-20Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds1-4/+17
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
2024-05-19Merge tag 'perf-urgent-2024-05-18' of ↵Linus Torvalds2-91/+253
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf event updates from Ingo Molnar: - Extend the x86 instruction decoder with APX and other new instructions - Misc cleanups * tag 'perf-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/cstate: Remove unused 'struct perf_cstate_msr' perf/x86/rapl: Rename 'maxdie' to nr_rapl_pmu and 'dieid' to rapl_pmu_idx x86/insn: Add support for APX EVEX instructions to the opcode map x86/insn: Add support for APX EVEX to the instruction decoder logic x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map x86/insn: Add support for REX2 prefix to the instruction decoder logic x86/insn: Add misc new Intel instructions x86/insn: Add VEX versions of VPDPBUSD, VPDPBUSDS, VPDPWSSD and VPDPWSSDS x86/insn: Fix PUSH instruction in x86 instruction decoder opcode map x86/insn: Add Key Locker instructions to the opcode map
2024-05-18Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2-16/+0
Pull rdma updates from Jason Gunthorpe: "Aside from the usual things this has an arch update for __iowrite64_copy() used by the RDMA drivers. This API was intended to generate large 64 byte MemWr TLPs on PCI. These days most processors had done this by just repeating writel() in a loop. S390 and some new ARM64 designs require a special helper to get this to generate. - Small improvements and fixes for erdma, efa, hfi1, bnxt_re - Fix a UAF crash after module unload on leaking restrack entry - Continue adding full RDMA support in mana with support for EQs, GID's and CQs - Improvements to the mkey cache in mlx5 - DSCP traffic class support in hns and several bug fixes - Cap the maximum number of MADs in the receive queue to avoid OOM - Another batch of rxe bug fixes from large scale testing - __iowrite64_copy() optimizations for write combining MMIO memory - Remove NULL checks before dev_put/hold() - EFA support for receive with immediate - Fix a recent memleaking regression in a cma error path" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits) RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw RDMA/IPoIB: Fix format truncation compilation errors bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq RDMA/efa: Support QP with unsolicited write w/ imm. receive IB/hfi1: Remove generic .ndo_get_stats64 IB/hfi1: Do not use custom stat allocator RDMA/hfi1: Use RMW accessors for changing LNKCTL2 RDMA/mana_ib: implement uapi for creation of rnic cq RDMA/mana_ib: boundary check before installing cq callbacks RDMA/mana_ib: introduce a helper to remove cq callbacks RDMA/mana_ib: create and destroy RNIC cqs RDMA/mana_ib: create EQs for RNIC CQs RDMA/core: Remove NULL check before dev_{put, hold} RDMA/ipoib: Remove NULL check before dev_{put, hold} RDMA/mlx5: Remove NULL check before dev_{put, hold} RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources. RDMA/core: Add an option to display driver-specific QPs in the rdmatool RDMA/efa: Add shutdown notifier RDMA/mana_ib: Fix missing ret value IB/mlx5: Use __iowrite64_copy() for write combining stores ...
2024-05-02x86/insn: Add support for APX EVEX instructions to the opcode mapAdrian Hunter1-0/+93
To support APX functionality, the EVEX prefix is used to: - promote legacy instructions - promote VEX instructions - add new instructions Promoted VEX instructions require no extra annotation because the opcodes do not change and the permissive nature of the instruction decoder already allows them to have an EVEX prefix. Promoted legacy instructions and new instructions are placed in map 4 which has not been used before. Create a new table for map 4 and add APX instructions. Annotate SCALABLE instructions with "(es)" - refer to patch "x86/insn: Add support for APX EVEX to the instruction decoder logic". SCALABLE instructions must be represented in both no-prefix (NP) and 66 prefix forms. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240502105853.5338-9-adrian.hunter@intel.com
2024-05-02x86/insn: Add support for APX EVEX to the instruction decoder logicAdrian Hunter1-0/+4
Intel Advanced Performance Extensions (APX) extends the EVEX prefix to support: - extended general purpose registers (EGPRs) i.e. r16 to r31 - Push-Pop Acceleration (PPX) hints - new data destination (NDD) register - suppress status flags writes (NF) of common instructions - new instructions Refer to the Intel Advanced Performance Extensions (Intel APX) Architecture Specification for details. The extended EVEX prefix does not need amended instruction decoder logic, except in one area. Some instructions are defined as SCALABLE which means the EVEX.W bit and EVEX.pp bits are used to determine operand size. Specifically, if an instruction is SCALABLE and EVEX.W is zero, then EVEX.pp value 0 (representing no prefix NP) means default operand size, whereas EVEX.pp value 1 (representing 66 prefix) means operand size override i.e. 16 bits Add an attribute (INAT_EVEX_SCALABLE) to identify such instructions, and amend the logic appropriately. Amend the awk script that generates the attribute tables from the opcode map, to recognise "(es)" as attribute INAT_EVEX_SCALABLE. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240502105853.5338-8-adrian.hunter@intel.com
2024-05-02x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder ↵Adrian Hunter1-72/+76
opcode map Support for REX2 has been added to the instruction decoder logic and the awk script that generates the attribute tables from the opcode map. Add REX2 prefix byte (0xD5) to the opcode map. Add annotation (!REX2) for map 0/1 opcodes that are reserved under REX2. Add JMPABS to the opcode map and add annotation (REX2) to identify that it has a mandatory REX2 prefix. A separate opcode attribute table is not needed at this time because JMPABS has the same attribute encoding as the MOV instruction that it shares an opcode with i.e. INAT_MOFFSET. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240502105853.5338-7-adrian.hunter@intel.com
2024-05-02x86/insn: Add support for REX2 prefix to the instruction decoder logicAdrian Hunter1-0/+25
Intel Advanced Performance Extensions (APX) uses a new 2-byte prefix named REX2 to select extended general purpose registers (EGPRs) i.e. r16 to r31. The REX2 prefix is effectively an extended version of the REX prefix. REX2 and EVEX are also used with PUSH/POP instructions to provide a Push-Pop Acceleration (PPX) hint. With PPX hints, a CPU will attempt to fast-forward register data between matching PUSH and POP instructions. REX2 is valid only with opcodes in maps 0 and 1. Similar extension for other maps is provided by the EVEX prefix, covered in a separate patch. Some opcodes in maps 0 and 1 are reserved under REX2. One of these is used for a new 64-bit absolute direct jump instruction JMPABS. Refer to the Intel Advanced Performance Extensions (Intel APX) Architecture Specification for details. Define a code value for the REX2 prefix (INAT_PFX_REX2), and add attribute flags for opcodes reserved under REX2 (INAT_NO_REX2) and to identify opcodes (only JMPABS) that require a mandatory REX2 prefix (INAT_REX2_VARIANT). Amend logic to read the REX2 prefix and get the opcode attribute for the map number (0 or 1) encoded in the REX2 prefix. Amend the awk script that generates the attribute tables from the opcode map, to recognise "REX2" as attribute INAT_PFX_REX2, and "(!REX2)" as attribute INAT_NO_REX2, and "(REX2)" as attribute INAT_REX2_VARIANT. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240502105853.5338-6-adrian.hunter@intel.com
2024-05-02x86/insn: Add misc new Intel instructionsAdrian Hunter1-12/+45
The x86 instruction decoder is used not only for decoding kernel instructions. It is also used by perf uprobes (user space probes) and by perf tools Intel Processor Trace decoding. Consequently, it needs to support instructions executed by user space also. Add instructions documented in Intel Architecture Instruction Set Extensions and Future Features Programming Reference March 2024 319433-052, that have not been added yet: AADD AAND AOR AXOR CMPccXADD PBNDKB RDMSRLIST URDMSR UWRMSR VBCSTNEBF162PS VBCSTNESH2PS VCVTNEEBF162PS VCVTNEEPH2PS VCVTNEOBF162PS VCVTNEOPH2PS VCVTNEPS2BF16 VPDPB[SU,UU,SS]D[,S] VPDPW[SU,US,UU]D[,S] VPMADD52HUQ VPMADD52LUQ VSHA512MSG1 VSHA512MSG2 VSHA512RNDS2 VSM3MSG1 VSM3MSG2 VSM3RNDS2 VSM4KEY4 VSM4RNDS4 WRMSRLIST TCMMIMFP16PS TCMMRLFP16PS TDPFP16PS PREFETCHIT1 PREFETCHIT0 Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240502105853.5338-5-adrian.hunter@intel.com