x86/crc: add "template" for [V]PCLMULQDQ based CRC functions - kernel/linux.git

diff options

author	Eric Biggers <ebiggers@google.com>	2025-02-10 20:26:44 +0300
committer	Eric Biggers <ebiggers@google.com>	2025-02-10 20:49:23 +0300
commit	8d2d3e72e35b7de26746ef18dc8ad5f89a2b2e25 (patch)
tree	13cb3c7ffbfe2fa38c9097683d0298e49613ec10 /scripts/generate_rust_analyzer.py
parent	31c89102cf39fb7a171283f35c41f57bf422a4df (diff)
download	linux-8d2d3e72e35b7de26746ef18dc8ad5f89a2b2e25.tar.xz

x86/crc: add "template" for [V]PCLMULQDQ based CRC functions

The Linux kernel implements many variants of CRC, such as crc16, crc_t10dif, crc32_le, crc32c, crc32_be, crc64_nvme, and crc64_be. On x86, except for crc32c which has special scalar instructions, the fastest way to compute any of these CRCs on any message of length roughly >= 16 bytes is to use the SIMD carryless multiplication instructions PCLMULQDQ or VPCLMULQDQ. Depending on the available CPU features this can mean PCLMULQDQ+SSE4.1, VPCLMULQDQ+AVX2, VPCLMULQDQ+AVX10/256, or VPCLMULQDQ+AVX10/512 (or the AVX512 equivalents to AVX10/*). This results in a total of 20+ CRC implementations being potentially needed to properly optimize all CRCs that someone cares about for x86. Besides crc32c, currently only crc32_le and crc_t10dif are actually optimized for x86, and they only use PCLMULQDQ, which means they can be 2-4x slower than what is possible with VPCLMULQDQ. Fortunately, at a high level the code that is needed for any [V]PCLMULQDQ based CRC implementation is mostly the same. Therefore, this patch introduces an assembly macro that expands into the body of a [V]PCLMULQDQ based CRC function for a given number of bits (8, 16, 32, or 64), bit order (lsb or msb-first), vector length, and AVX level. The function expects to be passed a constants table, specific to the polynomial desired, that was generated by the script previously added. When two CRC variants share the same number of bits and bit order, the same functions can be reused, with only the constants table differing. A new C header is also added to make it easy to integrate the new assembly code using a static call. The result is that it becomes straightforward to wire up an optimized implementation of any CRC-8, CRC-16, CRC-32, or CRC-64 for x86. Later patches will wire up specific CRC variants. Although this new template allows easily generating many functions, care was taken to still keep the binary size fairly low. Each generated function is only 550 to 850 bytes depending on the CRC variant and target CPU features. And only one function per CRC variant is actually used at runtime (since all functions support all lengths >= 16 bytes). Note that a similar approach should also work for other architectures that have carryless multiplication instructions, such as arm64. Acked-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20250210174540.161705-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>

Diffstat (limited to 'scripts/generate_rust_analyzer.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: