crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support - kernel/linux.git

diff options

author	Eric Biggers <ebiggers@google.com>	2025-02-10 19:50:20 +0300
committer	Herbert Xu <herbert@gondor.apana.org.au>	2025-02-22 10:56:03 +0300
commit	8c4fc9ce402cda3132273ea8a9ee4e0302296762 (patch)
tree	f286a64136d4f2282f1ee584ae54f98ace3f4ecf /include/linux
parent	77cb2f63ad6c546267b2fcb428cf9fceedef279a (diff)
download	linux-8c4fc9ce402cda3132273ea8a9ee4e0302296762.tar.xz

crypto: x86/aes-ctr - rewrite AESNI+AVX optimized CTR and add VAES support

Delete aes_ctrby8_avx-x86_64.S and add a new assembly file aes-ctr-avx-x86_64.S which follows a similar approach to aes-xts-avx-x86_64.S in that it uses a "template" to provide AESNI+AVX, VAES+AVX2, VAES+AVX10/256, and VAES+AVX10/512 code, instead of just AESNI+AVX. Wire it up to the crypto API accordingly. This greatly improves the performance of AES-CTR and AES-XCTR on VAES-capable CPUs, with the best case being AMD Zen 5 where an over 230% increase in throughput is seen on long messages. Performance on non-VAES-capable CPUs remains about the same, and the non-AVX AES-CTR code (aesni_ctr_enc) is also kept as-is for now. There are some slight regressions (less than 10%) on some short message lengths on some CPUs; these are difficult to avoid, given how the previous code was so heavily unrolled by message length, and they are not particularly important. Detailed performance results are given in the tables below. Both CTR and XCTR support is retained. The main loop remains 8-vector-wide, which differs from the 4-vector-wide main loops that are used in the XTS and GCM code. A wider loop is appropriate for CTR and XCTR since they have fewer other instructions (such as vpclmulqdq) to interleave with the AES instructions. Similar to what was the case for AES-GCM, the new assembly code also has a much smaller binary size, as it fixes the excessive unrolling by data length and key length present in the old code. Specifically, the new assembly file compiles to about 9 KB of text vs. 28 KB for the old file. This is despite 4x as many implementations being included. The tables below show the detailed performance results. The tables show percentage improvement in single-threaded throughput for repeated encryption of the given message length; an increase from 6000 MB/s to 12000 MB/s would be listed as 100%. They were collected by directly measuring the Linux crypto API performance using a custom kernel module. The tested CPUs were all server processors from Google Compute Engine except for Zen 5 which was a Ryzen 9 9950X desktop processor. Table 1: AES-256-CTR throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 | 232% | 203% | 212% | 143% | 71% | 95% | Intel Emerald Rapids | 116% | 116% | 117% | 91% | 78% | 79% | Intel Ice Lake | 109% | 103% | 107% | 81% | 54% | 56% | AMD Zen 4 | 109% | 91% | 100% | 70% | 43% | 59% | AMD Zen 3 | 92% | 78% | 87% | 57% | 32% | 43% | AMD Zen 2 | 9% | 8% | 14% | 12% | 8% | 21% | Intel Skylake | 7% | 7% | 8% | 5% | 3% | 8% | | 300 | 200 | 64 | 63 | 16 | ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 | 57% | 39% | -9% | 7% | -7% | Intel Emerald Rapids | 37% | 42% | -0% | 13% | -8% | Intel Ice Lake | 39% | 30% | -1% | 14% | -9% | AMD Zen 4 | 42% | 38% | -0% | 18% | -3% | AMD Zen 3 | 38% | 35% | 6% | 31% | 5% | AMD Zen 2 | 24% | 23% | 5% | 30% | 3% | Intel Skylake | 9% | 1% | -4% | 10% | -7% | Table 2: AES-256-XCTR throughput improvement, CPU microarchitecture vs. message length in bytes: | 16384 | 4096 | 4095 | 1420 | 512 | 500 | ---------------------+-------+-------+-------+-------+-------+-------+ AMD Zen 5 | 240% | 201% | 216% | 151% | 75% | 108% | Intel Emerald Rapids | 100% | 99% | 102% | 91% | 94% | 104% | Intel Ice Lake | 93% | 89% | 92% | 74% | 50% | 64% | AMD Zen 4 | 86% | 75% | 83% | 60% | 41% | 52% | AMD Zen 3 | 73% | 63% | 69% | 45% | 21% | 33% | AMD Zen 2 | -2% | -2% | 2% | 3% | -1% | 11% | Intel Skylake | -1% | -1% | 1% | 2% | -1% | 9% | | 300 | 200 | 64 | 63 | 16 | ---------------------+-------+-------+-------+-------+-------+ AMD Zen 5 | 78% | 56% | -4% | 38% | -2% | Intel Emerald Rapids | 61% | 55% | 4% | 32% | -5% | Intel Ice Lake | 57% | 42% | 3% | 44% | -4% | AMD Zen 4 | 35% | 28% | -1% | 17% | -3% | AMD Zen 3 | 26% | 23% | -3% | 11% | -6% | AMD Zen 2 | 13% | 24% | -1% | 14% | -3% | Intel Skylake | 16% | 8% | -4% | 35% | -3% | Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Diffstat (limited to 'include/linux')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: