Age | Commit message (Collapse) | Author | Files | Lines |
|
insn_decoder_test found a problem with decoding APX CTEST instructions:
Found an x86 instruction decoder bug, please report this.
ffffffff810021df 62 54 94 05 85 ff ctestneq
objdump says 6 bytes, but insn_get_length() says 5
It happens because x86-opcode-map.txt doesn't specify arguments for the
instruction and the decoder doesn't expect to see ModRM byte.
Fixes: 690ca3a3067f ("x86/insn: Add support for APX EVEX instructions to the opcode map")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: stable@vger.kernel.org # v6.10+
Link: https://lore.kernel.org/r/20250423065815.2003231-1-kirill.shutemov@linux.intel.com
|
|
copy_user_generic(), on non-FSRM/ERMS CPUs
History of the performance regression:
======================================
Since the following series of user copy updates were merged upstream
~2 years ago via:
a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications")
.. copy_user_generic() on x86_64 stopped doing alignment of the
writes to the destination to a 8 byte boundary for the non FSRM case.
Previously, this was done through the ALIGN_DESTINATION macro that
was used in the now removed copy_user_generic_unrolled function.
Turns out this change causes some loss of performance/throughput on
some use cases and specific CPU/platforms without FSRM and ERMS.
Lately I got two reports of performance/throughput issues after a
RHEL 9 kernel pulled the same upstream series with updates to user
copy functions. Both reports consisted of running specific
networking/TCP related testing using iperf3.
Partial upstream fix
====================
The first report was related to a Linux Bridge testing using VMs on a
specific machine with an AMD CPU (EPYC 7402), and after a brief
investigation it turned out that the later change via:
ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
... helped/fixed the performance issue.
However, after the later commit/fix was applied, then I got another
regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also
running on an AMD based platform (AMD EPYC 7302 CPU), again that was using
iperf3 to run the test. That regression was after applying the later
fix/commit, but only this didn't help in telling the whole history.
Testing performed to pinpoint residual regression
=================================================
So I narrowed down the second regression use case, but running it
without traffic through a NIC, on localhost, in trying to narrow down
CPU usage and not being limited by other factor like network bandwidth.
I used another system also with an AMD CPU (AMD EPYC 7742). Basically,
I run iperf3 in server and client mode in the same system, for example:
- Start the server binding it to CPU core/thread 19:
$ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000
- Start the client always binding/running on CPU core/thread 17, using
perf to get statistics:
$ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \
-n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \
> stat-19.txt
For the client, always running/pinned to CPU 17. But for the iperf3 in
server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any
specific CPU. So it basically consisted with four runs of the same
commands, just changing the CPU which the server is pinned, or without
pinning by removing the taskset call before the server command. The CPUs
were chosen based on NUMA node they were on, this is the relevant output
of lscpu on the system:
$ lscpu
...
Model name: AMD EPYC 7742 64-Core Processor
...
Caches (sum of all):
L1d: 2 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 32 MiB (64 instances)
L3: 256 MiB (16 instances)
NUMA:
NUMA node(s): 4
NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123
NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125
NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127
...
So for the server run, when picking a CPU, I chose CPUs to be not on the same
node. The reason is with that I was able to get/measure relevant
performance differences when changing the alignment of the writes to the
destination in copy_user_generic.
Testing shows up to +81% performance improvement under iperf3
=============================================================
Here's a summary of the iperf3 runs:
# Vanilla upstream alignment:
CPU RATE SYS TIME sender-receiver
Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8%
Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9%
Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5%
Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5%
# With the attached patch (aligning writes in non ERMS/FSRM case):
CPU RATE SYS TIME sender-receiver
Server bind 19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0%
Server bind 21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7%
Server bind 23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8%
Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6%
So I consistently got better results when aligning the write. The
results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys
time and then the total time to run/transfer 50G of data. The last
field is the CPU usage of sender/receiver iperf3 process. It's also
worth to note that each pair of iperf3 runs may get slightly different
results on each run, but I always got consistent higher results with
the write alignment for this specific test of running the processes
on CPUs in different NUMA nodes.
Linus Torvalds helped/provided this version of the patch. Initially I
proposed a version which aligned writes for all cases in
rep_movs_alternative, however it used two extra registers and thus
Linus provided an enhanced version that only aligns the write on the
large_movsq case, which is sufficient since the problem happens only
on those AMD CPUs like ones mentioned above without ERMS/FSRM, and
also doesn't require using extra registers. Also, I validated that
aligning only on large_movsq case is really enough for getting the
performance back.
I also tested this patch on an old Intel based non-ERMS/FRMS system
(with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems:
no performance enhancement but also no regression either, using the
same iperf3 based benchmark. Also newer Intel processors after
Sandy Bridge usually have ERMS and should not be affected by this change.
[ mingo: Updated the changelog. ]
Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS")
Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function")
Reported-by: Ondrej Lichtner <olichtne@redhat.com>
Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herton R. Krzesinski <herton@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
"Another set of improvements to the kernel's CRC (cyclic redundancy
check) code:
- Rework the CRC64 library functions to be directly optimized, like
what I did last cycle for the CRC32 and CRC-T10DIF library
functions
- Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
support and acceleration for crc64_be and crc64_nvme
- Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
crc_t10dif, crc64_be, and crc64_nvme
- Remove crc_t10dif and crc64_rocksoft from the crypto API, since
they are no longer needed there
- Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect
- Add kunit test cases for crc64_nvme and crc7
- Eliminate redundant functions for calculating the Castagnoli CRC32,
settling on just crc32c()
- Remove unnecessary prompts from some of the CRC kconfig options
- Further optimize the x86 crc32c code"
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
lib/crc: remove unnecessary prompt for CONFIG_CRC64
lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
lib/crc: remove unnecessary prompt for CONFIG_CRC8
lib/crc: remove unnecessary prompt for CONFIG_CRC7
lib/crc: remove unnecessary prompt for CONFIG_CRC4
lib/crc7: unexport crc7_be_syndrome_table
lib/crc_kunit.c: update comment in crc_benchmark()
lib/crc_kunit.c: add test and benchmark for crc7_be()
x86/crc32: optimize tail handling for crc32c short inputs
riscv/crc64: add Zbc optimized CRC64 functions
riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
riscv/crc32: reimplement the CRC32 functions using new template
riscv/crc: add "template" for Zbc optimized CRC functions
x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
x86/crc32: improve crc32c_arch() code generation with clang
x86/crc64: implement crc64_be and crc64_nvme using new template
x86/crc-t10dif: implement crc_t10dif using new template
x86/crc32: implement crc32_le using new template
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 speculation mitigation updates from Borislav Petkov:
- Some preparatory work to convert the mitigations machinery to
mitigating attack vectors instead of single vulnerabilities
- Untangle and remove a now unneeded X86_FEATURE_USE_IBPB flag
- Add support for a Zen5-specific SRSO mitigation
- Cleanups and minor improvements
* tag 'x86_bugs_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Make spectre user default depend on MITIGATION_SPECTRE_V2
x86/bugs: Use the cpu_smt_possible() helper instead of open-coded code
x86/bugs: Add AUTO mitigations for mds/taa/mmio/rfds
x86/bugs: Relocate mds/taa/mmio/rfds defines
x86/bugs: Add X86_BUG_SPECTRE_V2_USER
x86/bugs: Remove X86_FEATURE_USE_IBPB
KVM: nVMX: Always use IBPB to properly virtualize IBRS
x86/bugs: Use a static branch to guard IBPB on vCPU switch
x86/bugs: Remove the X86_FEATURE_USE_IBPB check in ib_prctl_set()
x86/mm: Remove X86_FEATURE_USE_IBPB checks in cond_mitigation()
x86/bugs: Move the X86_FEATURE_USE_IBPB check into callers
x86/bugs: KVM: Add support for SRSO_MSR_FIX
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Ingo Molnar:
"Miscellaneous x86 cleanups by Arnd Bergmann, Charles Han, Mirsad
Todorovac, Randy Dunlap, Thorsten Blum and Zhang Kunbo"
* tag 'x86-cleanups-2025-03-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/coco: Replace 'static const cc_mask' with the newly introduced cc_get_mask() function
x86/delay: Fix inconsistent whitespace
selftests/x86/syscall: Fix coccinelle WARNING recommending the use of ARRAY_SIZE()
x86/platform: Fix missing declaration of 'x86_apple_machine'
x86/irq: Fix missing declaration of 'io_apic_irqs'
x86/usercopy: Fix kernel-doc func param name in clean_cache_range()'s description
x86/apic: Use str_disabled_enabled() helper in print_ipi_mode()
|
|
Intel made a late change to the AVX10 specification that removes support
for a 256-bit maximum vector length and enumeration of the maximum
vector length. AVX10 will imply a maximum vector length of 512 bits.
I.e. there won't be any such thing as AVX10/256 or AVX10/512; there will
just be AVX10, and it will essentially just consolidate AVX512 features.
As a result of this new development, my strategy of providing both
*_avx10_256 and *_avx10_512 functions didn't turn out to be that useful.
The only remaining motivation for the 256-bit AVX512 / AVX10 functions
is to avoid downclocking on older Intel CPUs. But I already wrote
*_avx2 code too (primarily to support CPUs without AVX512), which
performs almost as well as *_avx10_256. So we should just use that.
Therefore, remove the *_avx10_256 CRC functions, and rename the
*_avx10_512 CRC functions to *_avx512. Make Ice Lake and Tiger Lake use
the *_avx2 functions instead of *_avx10_256 which they previously used.
Link: https://lore.kernel.org/r/20250319181316.91271-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add an assembly macro to refer runtime cost. It hides linker magic and
makes assembly more readable.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20250304153342.2016569-1-kirill.shutemov@linux.intel.com
|
|
For handling the 0 <= len < sizeof(unsigned long) bytes left at the end,
do a 4-2-1 step-down instead of a byte-at-a-time loop. This allows
taking advantage of wider CRC instructions. Note that crc32c-3way.S
already uses this same optimization too.
crc_kunit shows an improvement of about 25% for len=127.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250304213216.108925-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Smatch warns about this whitespace damage:
arch/x86/lib/delay.c:134 delay_halt_mwaitx() warn: inconsistent indenting
Signed-off-by: Charles Han <hanchunchao@inspur.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250305063515.3951-1-hanchunchao@inspur.com
|
|
No functional change.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20250303165246.2175811-6-brgerst@gmail.com
|
|
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
Replace X86_CMPXCHG64 with X86_CX8, as CX8 is the name of the CPUID
flag, thus to make it consistent with X86_FEATURE_CX8 defined in
<asm/cpufeatures.h>.
No functional change intended.
Signed-off-by: H. Peter Anvin (Intel) <hpa@zytor.com>
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250228082338.73859-2-xin@zytor.com
|
|
Add support for
CPUID Fn8000_0021_EAX[31] (SRSO_MSR_FIX). If this bit is 1, it
indicates that software may use MSR BP_CFG[BpSpecReduce] to mitigate
SRSO.
Enable BpSpecReduce to mitigate SRSO across guest/host boundaries.
Switch back to enabling the bit when virtualization is enabled and to
clear the bit when virtualization is disabled because using a MSR slot
would clear the bit when the guest is exited and any training the guest
has done, would potentially influence the host kernel when execution
enters the kernel and hasn't VMRUN the guest yet.
More detail on the public thread in Link below.
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241202120416.6054-1-bp@kernel.org
|
|
Add an array of code thunks, to be called from the FineIBT preamble,
clobbering the first 'n' argument registers for speculative execution.
Notably the 0th entry will clobber no argument registers and will never
be used, it exists so the array can be naturally indexed, while the 7th
entry will clobber all the 6 argument registers and also RSP in order to
mess up stack based arguments.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Kees Cook <kees@kernel.org>
Link: https://lore.kernel.org/r/20250224124200.717378681@infradead.org
|
|
description
Use @addr instead of @vaddr in the kernel-doc comment for
clean_cache_range() to eliminate warnings:
arch/x86/lib/usercopy_64.c:29: warning: Function parameter or struct member 'addr' not described in 'clean_cache_range'
arch/x86/lib/usercopy_64.c:29: warning: Excess function parameter 'vaddr' description in 'clean_cache_range'
Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250111063333.911084-1-rdunlap@infradead.org
|
|
The assembly functions generated by crc-pclmul-template.S are called
only via static_call, so they do not need to begin with an endbr
instruction. But objtool still warns about a missing endbr by default.
Add ANNOTATE_NOENDBR to suppress these warnings:
vmlinux.o: warning: objtool: crc32_x86_init+0x1c0: relocation to !ENDBR: crc32_lsb_vpclmul_avx10_256+0x0
vmlinux.o: warning: objtool: crc64_x86_init+0x183: relocation to !ENDBR: crc64_msb_vpclmul_avx10_256+0x0
vmlinux.o: warning: objtool: crc_t10dif_x86_init+0x183: relocation to !ENDBR: crc16_msb_vpclmul_avx10_256+0x0
vmlinux.o: warning: objtool: __SCK__crc32_lsb_pclmul+0x0: data relocation to !ENDBR: crc32_lsb_pclmul_sse+0x0
vmlinux.o: warning: objtool: __SCK__crc64_lsb_pclmul+0x0: data relocation to !ENDBR: crc64_lsb_pclmul_sse+0x0
vmlinux.o: warning: objtool: __SCK__crc64_msb_pclmul+0x0: data relocation to !ENDBR: crc64_msb_pclmul_sse+0x0
vmlinux.o: warning: objtool: __SCK__crc16_msb_pclmul+0x0: data relocation to !ENDBR: crc16_msb_pclmul_sse+0x0
Fixes: 8d2d3e72e35b ("x86/crc: add "template" for [V]PCLMULQDQ based CRC functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Closes: https://lore.kernel.org/r/20250217170555.3d14df62@canb.auug.org.au/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250217193230.100443-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
With the introduction of kCFI the addition of ENDBR to
SYM_FUNC_START* no longer suffices to make the function indirectly
callable. This now requires the use of SYM_TYPED_FUNC_START.
As such, remove the implicit ENDBR from SYM_FUNC_START* and add some
explicit annotations to fix things up again.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.409116003@infradead.org
|
|
The expectation is that all EXPORT'ed symbols are free to have their
address taken and called indirectly. The majority of the assembly
defined functions currently violate this expectation.
Make then all use SYM_TYPED_FUNC_START() in order to emit the proper
kCFI preamble.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Link: https://lore.kernel.org/r/20250207122546.302679189@infradead.org
|
|
crc32c_arch() is affected by
https://github.com/llvm/llvm-project/issues/20571 where clang
unnecessarily spills the inputs to "rm"-constrained operands to the
stack. Replace "rm" with ASM_INPUT_RM which partially works around this
by expanding to "r" when the compiler is clang. This results in better
code generation with clang, though still not optimal.
Link: https://lore.kernel.org/r/20250210210741.471725-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Add x86_64 [V]PCLMULQDQ optimized implementations of crc64_be() and
crc64_nvme() by wiring them up to crc-pclmul-template.S.
crc64_be() is used by bcache and bcachefs, and crc64_nvme() is used by
blk-integrity. Both features can CRC large amounts of data, and the
developers of both features have expressed interest in having these CRCs
be optimized. So this optimization should be worthwhile. (See
https://lore.kernel.org/r/v36sousjd5ukqlkpdxslvpu7l37zbu7d7slgc2trjjqwty2bny@qgzew34feo2r
and
https://lore.kernel.org/r/20220222163144.1782447-11-kbusch@kernel.org)
Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit:
crc64_be:
Length Before After
------ ------ -----
1 633 MB/s 477 MB/s
16 717 MB/s 2517 MB/s
64 715 MB/s 7525 MB/s
127 714 MB/s 10002 MB/s
128 713 MB/s 13344 MB/s
200 715 MB/s 15752 MB/s
256 714 MB/s 22933 MB/s
511 715 MB/s 28025 MB/s
512 714 MB/s 49772 MB/s
1024 715 MB/s 65261 MB/s
3173 714 MB/s 78773 MB/s
4096 714 MB/s 83315 MB/s
16384 714 MB/s 89487 MB/s
crc64_nvme:
Length Before After
------ ------ -----
1 716 MB/s 474 MB/s
16 717 MB/s 3303 MB/s
64 713 MB/s 7940 MB/s
127 715 MB/s 9867 MB/s
128 714 MB/s 13698 MB/s
200 715 MB/s 15995 MB/s
256 714 MB/s 23479 MB/s
511 714 MB/s 28013 MB/s
512 715 MB/s 51533 MB/s
1024 715 MB/s 66788 MB/s
3173 715 MB/s 79182 MB/s
4096 715 MB/s 83966 MB/s
16384 715 MB/s 89739 MB/s
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250210174540.161705-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Instantiate crc-pclmul-template.S for crc_t10dif and delete the original
PCLMULQDQ optimized implementation. This has the following advantages:
- Less CRC-variant-specific code.
- VPCLMULQDQ support, greatly improving performance on sufficiently long
messages on newer CPUs.
- A faster reduction from 128 bits to the final CRC.
- Support for i386.
Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit:
Length Before After
------ ------ -----
1 440 MB/s 386 MB/s
16 1865 MB/s 2008 MB/s
64 4343 MB/s 6917 MB/s
127 5440 MB/s 8909 MB/s
128 5533 MB/s 12150 MB/s
200 5908 MB/s 14423 MB/s
256 15870 MB/s 21288 MB/s
511 14219 MB/s 25840 MB/s
512 18361 MB/s 37797 MB/s
1024 19941 MB/s 61374 MB/s
3173 20461 MB/s 74909 MB/s
4096 21310 MB/s 78919 MB/s
16384 21663 MB/s 85012 MB/s
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250210174540.161705-6-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Instantiate crc-pclmul-template.S for crc32_le, and delete the original
PCLMULQDQ optimized implementation. This has the following advantages:
- Less CRC-variant-specific code.
- VPCLMULQDQ support, greatly improving performance on sufficiently long
messages on newer CPUs.
- A faster reduction from 128 bits to the final CRC.
- Support for lengths not a multiple of 16 bytes, improving performance
for such lengths.
- Support for misaligned buffers, improving performance in such cases.
Benchmark results on AMD Ryzen 9 9950X (Zen 5) using crc_kunit:
Length Before After
------ ------ -----
1 427 MB/s 605 MB/s
16 710 MB/s 3631 MB/s
64 704 MB/s 7615 MB/s
127 3610 MB/s 9710 MB/s
128 8759 MB/s 12702 MB/s
200 7083 MB/s 15343 MB/s
256 17284 MB/s 22904 MB/s
511 10919 MB/s 27309 MB/s
512 19849 MB/s 48900 MB/s
1024 21216 MB/s 62630 MB/s
3173 22150 MB/s 72437 MB/s
4096 22496 MB/s 79593 MB/s
16384 22018 MB/s 85106 MB/s
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250210174540.161705-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
The Linux kernel implements many variants of CRC, such as crc16,
crc_t10dif, crc32_le, crc32c, crc32_be, crc64_nvme, and crc64_be. On
x86, except for crc32c which has special scalar instructions, the
fastest way to compute any of these CRCs on any message of length
roughly >= 16 bytes is to use the SIMD carryless multiplication
instructions PCLMULQDQ or VPCLMULQDQ. Depending on the available CPU
features this can mean PCLMULQDQ+SSE4.1, VPCLMULQDQ+AVX2,
VPCLMULQDQ+AVX10/256, or VPCLMULQDQ+AVX10/512 (or the AVX512 equivalents
to AVX10/*). This results in a total of 20+ CRC implementations being
potentially needed to properly optimize all CRCs that someone cares
about for x86. Besides crc32c, currently only crc32_le and crc_t10dif
are actually optimized for x86, and they only use PCLMULQDQ, which means
they can be 2-4x slower than what is possible with VPCLMULQDQ.
Fortunately, at a high level the code that is needed for any
[V]PCLMULQDQ based CRC implementation is mostly the same. Therefore,
this patch introduces an assembly macro that expands into the body of a
[V]PCLMULQDQ based CRC function for a given number of bits (8, 16, 32,
or 64), bit order (lsb or msb-first), vector length, and AVX level.
The function expects to be passed a constants table, specific to the
polynomial desired, that was generated by the script previously added.
When two CRC variants share the same number of bits and bit order, the
same functions can be reused, with only the constants table differing.
A new C header is also added to make it easy to integrate the new
assembly code using a static call.
The result is that it becomes straightforward to wire up an optimized
implementation of any CRC-8, CRC-16, CRC-32, or CRC-64 for x86. Later
patches will wire up specific CRC variants.
Although this new template allows easily generating many functions, care
was taken to still keep the binary size fairly low. Each generated
function is only 550 to 850 bytes depending on the CRC variant and
target CPU features. And only one function per CRC variant is actually
used at runtime (since all functions support all lengths >= 16 bytes).
Note that a similar approach should also work for other architectures
that have carryless multiplication instructions, such as arm64.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250210174540.161705-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
With the "crct10dif" algorithm having been removed from the crypto API,
crc_t10dif_is_optimized() is no longer used.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250208175647.12333-1-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Following the standardization on crc32c() as the lib entry point for the
Castagnoli CRC32 instead of the previous mix of crc32c(), crc32c_le(),
and __crc32c_le(), make the same change to the underlying base and arch
functions that implement it.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250208024911.14936-7-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull CRC updates from Eric Biggers:
- Reorganize the architecture-optimized CRC32 and CRC-T10DIF code to be
directly accessible via the library API, instead of requiring the
crypto API. This is much simpler and more efficient.
- Convert some users such as ext4 to use the CRC32 library API instead
of the crypto API. More conversions like this will come later.
- Add a KUnit test that tests and benchmarks multiple CRC variants.
Remove older, less-comprehensive tests that are made redundant by
this.
- Add an entry to MAINTAINERS for the kernel's CRC library code. I'm
volunteering to maintain it. I have additional cleanups and
optimizations planned for future cycles.
* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (31 commits)
MAINTAINERS: add entry for CRC library
powerpc/crc: delete obsolete crc-vpmsum_test.c
lib/crc32test: delete obsolete crc32test.c
lib/crc16_kunit: delete obsolete crc16_kunit.c
lib/crc_kunit.c: add KUnit test suite for CRC library functions
powerpc/crc-t10dif: expose CRC-T10DIF function through lib
arm64/crc-t10dif: expose CRC-T10DIF function through lib
arm/crc-t10dif: expose CRC-T10DIF function through lib
x86/crc-t10dif: expose CRC-T10DIF function through lib
crypto: crct10dif - expose arch-optimized lib function
lib/crc-t10dif: add support for arch overrides
lib/crc-t10dif: stop wrapping the crypto API
scsi: target: iscsi: switch to using the crc32c library
f2fs: switch to using the crc32 library
jbd2: switch to using the crc32c library
ext4: switch to using the crc32c library
lib/crc32: make crc32c() go directly to lib
bcachefs: Explicitly select CRYPTO from BCACHEFS_FS
x86/crc32: expose CRC32 functions through lib
x86/crc32: update prototype for crc32_pclmul_le_16()
...
|
|
This was a suggestion by David Laight, and while I was slightly worried
that some micro-architecture would predict cmov like a conditional
branch, there is little reason to actually believe any core would be
that broken.
Intel documents that their existing cores treat CMOVcc as a data
dependency that will constrain speculation in their "Speculative
Execution Side Channel Mitigations" whitepaper:
"Other instructions such as CMOVcc, AND, ADC, SBB and SETcc can also
be used to prevent bounds check bypass by constraining speculative
execution on current family 6 processors (Intel® Core™, Intel® Atom™,
Intel® Xeon® and Intel® Xeon Phi™ processors)"
and while that leaves the future uarch issues open, that's certainly
true of our traditional SBB usage too.
Any core that predicts CMOV will be unusable for various crypto
algorithms that need data-independent timing stability, so let's just
treat CMOV as the safe choice that simplifies the address masking by
avoiding an extra instruction and doesn't need a temporary register.
Suggested-by: David Laight <David.Laight@aculab.com>
Link: https://www.intel.com/content/dam/develop/external/us/en/documents/336996-speculative-execution-side-channel-mitigations.pdf
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Move the x86 CRC-T10DIF assembly code into the lib directory and wire it
up to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20241202012056.209768-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
Move the x86 CRC32 assembly code into the lib directory and wire it up
to the library interface. This allows it to be used without going
through the crypto API. It remains usable via the crypto API too via
the shash algorithms that use the library interface. Thus all the
arch-specific "shash" code becomes unnecessary and is removed.
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20241202010844.144356-14-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
|
|
It turns out that AMD has a "Meltdown Lite(tm)" issue with non-canonical
accesses in kernel space. And so using just the high bit to decide
whether an access is in user space or kernel space ends up with the good
old "leak speculative data" if you have the right gadget using the
result:
CVE-2020-12965 “Transient Execution of Non-Canonical Accesses“
Now, the kernel surrounds the access with a STAC/CLAC pair, and those
instructions end up serializing execution on older Zen architectures,
which closes the speculation window.
But that was true only up until Zen 5, which renames the AC bit [1].
That improves performance of STAC/CLAC a lot, but also means that the
speculation window is now open.
Note that this affects not just the new address masking, but also the
regular valid_user_address() check used by access_ok(), and the asm
version of the sign bit check in the get_user() helpers.
It does not affect put_user() or clear_user() variants, since there's no
speculative result to be used in a gadget for those operations.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Link: https://lore.kernel.org/all/80d94591-1297-4afb-b510-c665efd37f10@citrix.com/
Link: https://lore.kernel.org/all/20241023094448.GAZxjFkEOOF_DM83TQ@fat_crate.local/ [1]
Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html
Link: https://arxiv.org/pdf/2108.10771
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> # LAM case
Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional")
Fixes: 6014bc27561f ("x86-64: make access_ok() independent of LAM")
Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
|
|
Merge all pending locking commits into a single branch.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
|
|
While zeroing the upper 32 bits of an 8-byte getuser on 32-bit x86 was
fixed by commit 8c860ed825cb ("x86/uaccess: Fix missed zeroing of ia32 u64
get_user() range checking") it was broken again in commit 8a2462df1547
("x86/uaccess: Improve the 8-byte getuser() case").
This is because the register which holds the upper 32 bits (%ecx) is being
cleared _after_ the check_range, so if the range check fails, %ecx is never
cleared.
This can be reproduced with:
./tools/testing/kunit/kunit.py run --arch i386 usercopy
Instead, clear %ecx _before_ check_range in the 8-byte case. This
reintroduces a bit of the ugliness we were trying to avoid by adding
another #ifndef CONFIG_X86_64, but at least keeps check_range from needing
a separate bad_get_user_8 jump.
Fixes: 8a2462df1547 ("x86/uaccess: Improve the 8-byte getuser() case")
Signed-off-by: David Gow <davidgow@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/20240731073031.4045579-1-davidgow@google.com
|
|
Commit in Fixes was added as a catch-all for cases where the cmdline is
parsed before being merged with the builtin one.
And promptly one issue appeared, see Link below. The microcode loader
really needs to parse it that early, but the merging happens later.
Reshuffling the early boot nightmare^W code to handle that properly would
be a painful exercise for another day so do the chicken thing and parse the
builtin cmdline too before it has been merged.
Fixes: 0c40b1c7a897 ("x86/setup: Warn when option parsing is done too early")
Reported-by: Mike Lothian <mike@fireburn.co.uk>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20240730152108.GAZqkE5Dfi9AuKllRw@fat_crate.local
Link: https://lore.kernel.org/r/20240722152330.GCZp55ck8E_FT4kPnC@fat_crate.local
|
|
As described in commit:
e73c4e34a0e9 ("locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32")
the value preload before the CMPXCHG loop does not need to be atomic.
Introduce the read64_nonatomic assembly macro to load the value from a
atomic_t location in a faster non-atomic way and use it in
atomic64_cx8_32.S.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/r/20240605181424.3228-1-ubizjak@gmail.com
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- Make error checking of AMD SMN accesses more robust in the callers as
they're the only ones who can interpret the results properly
- The usual cleanups and fixes, left and right
* tag 'x86_misc_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kmsan: Fix hook for unaligned accesses
x86/platform/iosf_mbi: Convert PCIBIOS_* return codes to errnos
x86/pci/xen: Fix PCIBIOS_* return code handling
x86/pci/intel_mid_pci: Fix PCIBIOS_* return code handling
x86/of: Return consistent error type from x86_of_pci_irq_enable()
hwmon: (k10temp) Rename _data variable
hwmon: (k10temp) Remove unused HAVE_TDIE() macro
hwmon: (k10temp) Reduce k10temp_get_ccd_support() parameters
hwmon: (k10temp) Define a helper function to read CCD temperature
x86/amd_nb: Enhance SMN access error checking
hwmon: (k10temp) Check return value of amd_smn_read()
EDAC/amd64: Check return value of amd_smn_read()
EDAC/amd64: Remove unused register accesses
tools/x86/kcpuid: Add missing dir via Makefile
x86, arm: Add missing license tag to syscall tables files
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 uaccess update from Borislav Petkov:
- Cleanup the 8-byte getuser() asm case
* tag 'x86_core_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/uaccess: Improve the 8-byte getuser() case
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot updates from Borislav Petkov:
- Add a check to warn when cmdline parsing happens before the final
cmdline string has been built and thus arguments can get lost
- Code cleanups and simplifications
* tag 'x86_boot_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/setup: Warn when option parsing is done too early
x86/boot: Clean up the arch/x86/boot/main.c code a bit
x86/boot: Use current_stack_pointer to avoid asm() in init_heap()
|
|
When called with a 'from' that is not 4-byte-aligned, string_memcpy_fromio()
calls the movs() macro to copy the first few bytes, so that 'from' becomes
4-byte-aligned before calling rep_movs(). This movs() macro modifies 'to', and
the subsequent line modifies 'n'.
As a result, on unaligned accesses, kmsan_unpoison_memory() uses the updated
(aligned) values of 'to' and 'n'. Hence, it does not unpoison the entire
region.
Save the original values of 'to' and 'n', and pass those to
kmsan_unpoison_memory(), so that the entire region is unpoisoned.
Signed-off-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Alexander Potapenko <glider@google.com>
Link: https://lore.kernel.org/r/20240523215029.4160518-1-bjohannesmeyer@gmail.com
|
|
Streamline the 8-byte case and drop the special handling. Use a macro
which hides the exception handling.
No functional changes.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/CAHk-=whYb2L_atsRk9pBiFiVLGe5wNZLHhRinA69yu6FiKvDsw@mail.gmail.com
|
|
When reworking the range checking for get_user(), the get_user_8() case
on 32-bit wasn't zeroing the high register. (The jump to bad_get_user_8
was accidentally dropped.) Restore the correct error handling
destination (and rename the jump to using the expected ".L" prefix).
While here, switch to using a named argument ("size") for the call
template ("%c4" to "%c[size]") as already used in the other call
templates in this file.
Found after moving the usercopy selftests to KUnit:
# usercopy_test_invalid: EXPECTATION FAILED at
lib/usercopy_kunit.c:278
Expected val_u64 == 0, but
val_u64 == -60129542144 (0xfffffff200000000)
Closes: https://lore.kernel.org/all/CABVgOSn=tb=Lj9SxHuT4_9MTjjKVxsq-ikdXC4kGHO4CfKVmGQ@mail.gmail.com
Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()")
Reported-by: David Gow <davidgow@google.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Tested-by: David Gow <davidgow@google.com>
Link: https://lore.kernel.org/all/20240610210213.work.143-kees%40kernel.org
|
|
Commit
4faa0e5d6d79 ("x86/boot: Move kernel cmdline setup earlier in the boot process (again)")
fixed and issue where cmdline parsing would happen before the final
boot_command_line string has been built from the builtin and boot
cmdlines and thus cmdline arguments would get lost.
Add a check to catch any future wrong use ordering so that such issues
can be caught in time.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240409152541.GCZhVd9XIPXyTNd9vc@fat_crate.local
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-mm updates from Andrew Morton:
"Mainly singleton patches, documented in their respective changelogs.
Notable series include:
- Some maintenance and performance work for ocfs2 in Heming Zhao's
series "improve write IO performance when fragmentation is high".
- Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes
exposed by fstests".
- kfifo header rework from Andy Shevchenko in the series "kfifo:
Clean up kfifo.h".
- GDB script fixes from Florian Rommel in the series "scripts/gdb:
Fixes for $lx_current and $lx_per_cpu".
- After much discussion, a coding-style update from Barry Song
explaining one reason why inline functions are preferred over
macros. The series is "codingstyle: avoid unused parameters for a
function-like macro""
* tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits)
fs/proc: fix softlockup in __read_vmcore
nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON()
scripts: checkpatch: check unused parameters for function-like macro
Documentation: coding-style: ask function-like macros to evaluate parameters
nilfs2: use __field_struct() for a bitwise field
selftests/kcmp: remove unused open mode
nilfs2: remove calls to folio_set_error() and folio_clear_error()
kernel/watchdog_perf.c: tidy up kerneldoc
watchdog: allow nmi watchdog to use raw perf event
watchdog: handle comma separated nmi_watchdog command line
nilfs2: make superblock data array index computation sparse friendly
squashfs: remove calls to set the folio error flag
squashfs: convert squashfs_symlink_read_folio to use folio APIs
scripts/gdb: fix detection of current CPU in KGDB
scripts/gdb: make get_thread_info accept pointers
scripts/gdb: fix parameter handling in $lx_per_cpu
scripts/gdb: fix failing KGDB detection during probe
kfifo: don't use "proxy" headers
media: stih-cec: add missing io.h
media: rc: add missing io.h
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event updates from Ingo Molnar:
- Extend the x86 instruction decoder with APX and
other new instructions
- Misc cleanups
* tag 'perf-urgent-2024-05-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/cstate: Remove unused 'struct perf_cstate_msr'
perf/x86/rapl: Rename 'maxdie' to nr_rapl_pmu and 'dieid' to rapl_pmu_idx
x86/insn: Add support for APX EVEX instructions to the opcode map
x86/insn: Add support for APX EVEX to the instruction decoder logic
x86/insn: x86/insn: Add support for REX2 prefix to the instruction decoder opcode map
x86/insn: Add support for REX2 prefix to the instruction decoder logic
x86/insn: Add misc new Intel instructions
x86/insn: Add VEX versions of VPDPBUSD, VPDPBUSDS, VPDPWSSD and VPDPWSSDS
x86/insn: Fix PUSH instruction in x86 instruction decoder opcode map
x86/insn: Add Key Locker instructions to the opcode map
|
|
Pull rdma updates from Jason Gunthorpe:
"Aside from the usual things this has an arch update for
__iowrite64_copy() used by the RDMA drivers.
This API was intended to generate large 64 byte MemWr TLPs on PCI.
These days most processors had done this by just repeating writel() in
a loop. S390 and some new ARM64 designs require a special helper to
get this to generate.
- Small improvements and fixes for erdma, efa, hfi1, bnxt_re
- Fix a UAF crash after module unload on leaking restrack entry
- Continue adding full RDMA support in mana with support for EQs,
GID's and CQs
- Improvements to the mkey cache in mlx5
- DSCP traffic class support in hns and several bug fixes
- Cap the maximum number of MADs in the receive queue to avoid OOM
- Another batch of rxe bug fixes from large scale testing
- __iowrite64_copy() optimizations for write combining MMIO memory
- Remove NULL checks before dev_put/hold()
- EFA support for receive with immediate
- Fix a recent memleaking regression in a cma error path"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (70 commits)
RDMA/cma: Fix kmemleak in rdma_core observed during blktests nvme/rdma use siw
RDMA/IPoIB: Fix format truncation compilation errors
bnxt_re: avoid shift undefined behavior in bnxt_qplib_alloc_init_hwq
RDMA/efa: Support QP with unsolicited write w/ imm. receive
IB/hfi1: Remove generic .ndo_get_stats64
IB/hfi1: Do not use custom stat allocator
RDMA/hfi1: Use RMW accessors for changing LNKCTL2
RDMA/mana_ib: implement uapi for creation of rnic cq
RDMA/mana_ib: boundary check before installing cq callbacks
RDMA/mana_ib: introduce a helper to remove cq callbacks
RDMA/mana_ib: create and destroy RNIC cqs
RDMA/mana_ib: create EQs for RNIC CQs
RDMA/core: Remove NULL check before dev_{put, hold}
RDMA/ipoib: Remove NULL check before dev_{put, hold}
RDMA/mlx5: Remove NULL check before dev_{put, hold}
RDMA/mlx5: Track DCT, DCI and REG_UMR QPs as diver_detail resources.
RDMA/core: Add an option to display driver-specific QPs in the rdmatool
RDMA/efa: Add shutdown notifier
RDMA/mana_ib: Fix missing ret value
IB/mlx5: Use __iowrite64_copy() for write combining stores
...
|
|
To support APX functionality, the EVEX prefix is used to:
- promote legacy instructions
- promote VEX instructions
- add new instructions
Promoted VEX instructions require no extra annotation because the opcodes
do not change and the permissive nature of the instruction decoder already
allows them to have an EVEX prefix.
Promoted legacy instructions and new instructions are placed in map 4 which
has not been used before.
Create a new table for map 4 and add APX instructions.
Annotate SCALABLE instructions with "(es)" - refer to patch "x86/insn: Add
support for APX EVEX to the instruction decoder logic". SCALABLE
instructions must be represented in both no-prefix (NP) and 66 prefix
forms.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240502105853.5338-9-adrian.hunter@intel.com
|
|
Intel Advanced Performance Extensions (APX) extends the EVEX prefix to
support:
- extended general purpose registers (EGPRs) i.e. r16 to r31
- Push-Pop Acceleration (PPX) hints
- new data destination (NDD) register
- suppress status flags writes (NF) of common instructions
- new instructions
Refer to the Intel Advanced Performance Extensions (Intel APX) Architecture
Specification for details.
The extended EVEX prefix does not need amended instruction decoder logic,
except in one area. Some instructions are defined as SCALABLE which means
the EVEX.W bit and EVEX.pp bits are used to determine operand size.
Specifically, if an instruction is SCALABLE and EVEX.W is zero, then
EVEX.pp value 0 (representing no prefix NP) means default operand size,
whereas EVEX.pp value 1 (representing 66 prefix) means operand size
override i.e. 16 bits
Add an attribute (INAT_EVEX_SCALABLE) to identify such instructions, and
amend the logic appropriately.
Amend the awk script that generates the attribute tables from the opcode
map, to recognise "(es)" as attribute INAT_EVEX_SCALABLE.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240502105853.5338-8-adrian.hunter@intel.com
|
|
opcode map
Support for REX2 has been added to the instruction decoder logic and the
awk script that generates the attribute tables from the opcode map.
Add REX2 prefix byte (0xD5) to the opcode map.
Add annotation (!REX2) for map 0/1 opcodes that are reserved under REX2.
Add JMPABS to the opcode map and add annotation (REX2) to identify that it
has a mandatory REX2 prefix. A separate opcode attribute table is not
needed at this time because JMPABS has the same attribute encoding as the
MOV instruction that it shares an opcode with i.e. INAT_MOFFSET.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240502105853.5338-7-adrian.hunter@intel.com
|
|
Intel Advanced Performance Extensions (APX) uses a new 2-byte prefix named
REX2 to select extended general purpose registers (EGPRs) i.e. r16 to r31.
The REX2 prefix is effectively an extended version of the REX prefix.
REX2 and EVEX are also used with PUSH/POP instructions to provide a
Push-Pop Acceleration (PPX) hint. With PPX hints, a CPU will attempt to
fast-forward register data between matching PUSH and POP instructions.
REX2 is valid only with opcodes in maps 0 and 1. Similar extension for
other maps is provided by the EVEX prefix, covered in a separate patch.
Some opcodes in maps 0 and 1 are reserved under REX2. One of these is used
for a new 64-bit absolute direct jump instruction JMPABS.
Refer to the Intel Advanced Performance Extensions (Intel APX) Architecture
Specification for details.
Define a code value for the REX2 prefix (INAT_PFX_REX2), and add attribute
flags for opcodes reserved under REX2 (INAT_NO_REX2) and to identify
opcodes (only JMPABS) that require a mandatory REX2 prefix
(INAT_REX2_VARIANT).
Amend logic to read the REX2 prefix and get the opcode attribute for the
map number (0 or 1) encoded in the REX2 prefix.
Amend the awk script that generates the attribute tables from the opcode
map, to recognise "REX2" as attribute INAT_PFX_REX2, and "(!REX2)"
as attribute INAT_NO_REX2, and "(REX2)" as attribute INAT_REX2_VARIANT.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240502105853.5338-6-adrian.hunter@intel.com
|
|
The x86 instruction decoder is used not only for decoding kernel
instructions. It is also used by perf uprobes (user space probes) and by
perf tools Intel Processor Trace decoding. Consequently, it needs to
support instructions executed by user space also.
Add instructions documented in Intel Architecture Instruction Set
Extensions and Future Features Programming Reference March 2024
319433-052, that have not been added yet:
AADD
AAND
AOR
AXOR
CMPccXADD
PBNDKB
RDMSRLIST
URDMSR
UWRMSR
VBCSTNEBF162PS
VBCSTNESH2PS
VCVTNEEBF162PS
VCVTNEEPH2PS
VCVTNEOBF162PS
VCVTNEOPH2PS
VCVTNEPS2BF16
VPDPB[SU,UU,SS]D[,S]
VPDPW[SU,US,UU]D[,S]
VPMADD52HUQ
VPMADD52LUQ
VSHA512MSG1
VSHA512MSG2
VSHA512RNDS2
VSM3MSG1
VSM3MSG2
VSM3RNDS2
VSM4KEY4
VSM4RNDS4
WRMSRLIST
TCMMIMFP16PS
TCMMRLFP16PS
TDPFP16PS
PREFETCHIT1
PREFETCHIT0
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20240502105853.5338-5-adrian.hunter@intel.com
|