crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply - kernel/linux.git

diff options

author	Ard Biesheuvel <ardb@kernel.org>	2024-11-05 19:09:02 +0300
committer	Herbert Xu <herbert@gondor.apana.org.au>	2024-11-15 14:52:51 +0300
commit	67dfb1b73f423622a0096ea43fb1f5b7336f49e0 (patch)
tree	4dfecc1dab5aeb4742415d20f3443d2c3e0c1534 /lib/crypto/mpi/mpi-bit.c
parent	7048c21e6b50e4dec0de1ed48b12db50b94b3f57 (diff)
download	linux-67dfb1b73f423622a0096ea43fb1f5b7336f49e0.tar.xz

crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply

The CRC-T10DIF implementation for arm64 has a version that uses 8x8 polynomial multiplication, for cores that lack the crypto extensions, which cover the 64x64 polynomial multiplication instruction that the algorithm was built around. This fallback version rather naively adopted the 64x64 polynomial multiplication algorithm that I ported from ARM for the GHASH driver, which needs 8 PMULL8 instructions to implement one PMULL64. This is reasonable, given that each 8-bit vector element needs to be multiplied with each element in the other vector, producing 8 vectors with partial results that need to be combined to yield the correct result. However, most PMULL64 invocations in the CRC-T10DIF code involve multiplication by a pair of 16-bit folding coefficients, and so all the partial results from higher order bytes will be zero, and there is no need to calculate them to begin with. Then, the CRC-T10DIF algorithm always XORs the output values of the PMULL64 instructions being issued in pairs, and so there is no need to faithfully implement each individual PMULL64 instruction, as long as XORing the results pairwise produces the expected result. Implementing these improvements results in a speedup of 3.3x on low-end platforms such as Raspberry Pi 4 (Cortex-A72) Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Diffstat (limited to 'lib/crypto/mpi/mpi-bit.c')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: