diff options
author | Eric Biggers <ebiggers@google.com> | 2025-02-10 20:26:44 +0300 |
---|---|---|
committer | Eric Biggers <ebiggers@google.com> | 2025-02-10 20:49:23 +0300 |
commit | 8d2d3e72e35b7de26746ef18dc8ad5f89a2b2e25 (patch) | |
tree | 13cb3c7ffbfe2fa38c9097683d0298e49613ec10 /scripts/generate_rust_analyzer.py | |
parent | 31c89102cf39fb7a171283f35c41f57bf422a4df (diff) | |
download | linux-8d2d3e72e35b7de26746ef18dc8ad5f89a2b2e25.tar.xz |
x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
The Linux kernel implements many variants of CRC, such as crc16,
crc_t10dif, crc32_le, crc32c, crc32_be, crc64_nvme, and crc64_be. On
x86, except for crc32c which has special scalar instructions, the
fastest way to compute any of these CRCs on any message of length
roughly >= 16 bytes is to use the SIMD carryless multiplication
instructions PCLMULQDQ or VPCLMULQDQ. Depending on the available CPU
features this can mean PCLMULQDQ+SSE4.1, VPCLMULQDQ+AVX2,
VPCLMULQDQ+AVX10/256, or VPCLMULQDQ+AVX10/512 (or the AVX512 equivalents
to AVX10/*). This results in a total of 20+ CRC implementations being
potentially needed to properly optimize all CRCs that someone cares
about for x86. Besides crc32c, currently only crc32_le and crc_t10dif
are actually optimized for x86, and they only use PCLMULQDQ, which means
they can be 2-4x slower than what is possible with VPCLMULQDQ.
Fortunately, at a high level the code that is needed for any
[V]PCLMULQDQ based CRC implementation is mostly the same. Therefore,
this patch introduces an assembly macro that expands into the body of a
[V]PCLMULQDQ based CRC function for a given number of bits (8, 16, 32,
or 64), bit order (lsb or msb-first), vector length, and AVX level.
The function expects to be passed a constants table, specific to the
polynomial desired, that was generated by the script previously added.
When two CRC variants share the same number of bits and bit order, the
same functions can be reused, with only the constants table differing.
A new C header is also added to make it easy to integrate the new
assembly code using a static call.
The result is that it becomes straightforward to wire up an optimized
implementation of any CRC-8, CRC-16, CRC-32, or CRC-64 for x86. Later
patches will wire up specific CRC variants.
Although this new template allows easily generating many functions, care
was taken to still keep the binary size fairly low. Each generated
function is only 550 to 850 bytes depending on the CRC variant and
target CPU features. And only one function per CRC variant is actually
used at runtime (since all functions support all lengths >= 16 bytes).
Note that a similar approach should also work for other architectures
that have carryless multiplication instructions, such as arm64.
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250210174540.161705-4-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Diffstat (limited to 'scripts/generate_rust_analyzer.py')
0 files changed, 0 insertions, 0 deletions