kernel/linux.git/lib/crypto/x86, branch master

lib/crypto: x86/sm3: Migrate optimized code into library

2026-03-24T00:50:59+00:00

Instead of exposing the x86-optimized SM3 code via an x86-specific crypto_shash algorithm, instead just implement the sm3_blocks() library function. This is much simpler, it makes the SM3 library functions be x86-optimized, and it fixes the longstanding issue where the x86-optimized SM3 code was disabled by default. SM3 still remains available through crypto_shash, but individual architectures no longer need to handle it. Tweak the prototype of sm3_transform_avx() to match what the library expects, including changing the block count to size_t. Note that the assembly code actually already treated this argument as size_t. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260321040935.410034-10-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/ghash: Migrate optimized code into library

2026-03-23T23:44:29+00:00

Remove the "ghash-pclmulqdqni" crypto_shash algorithm. Move the corresponding assembly code into lib/crypto/, and wire it up to the GHASH library. This makes the GHASH library be optimized with x86's carryless multiplication instructions. It also greatly reduces the amount of x86-specific glue code that is needed, and it fixes the issue where this GHASH optimization was disabled by default. Rename and adjust the prototypes of the assembly functions to make them fit better with the library. Remove the byte-swaps (pshufb instructions) that are no longer necessary because the library keeps the accumulator in POLYVAL format rather than GHASH format. Rename clmul_ghash_mul() to polyval_mul_pclmul() to reflect that it really does a POLYVAL style multiplication. Wire it up to both ghash_mul_arch() and polyval_mul_arch(). Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260319061723.1140720-15-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: gf128hash: Support GF128HASH_ARCH without all POLYVAL functions

2026-03-23T20:15:13+00:00

Currently, some architectures (arm64 and x86) have optimized code for both GHASH and POLYVAL. Others (arm, powerpc, riscv, and s390) have optimized code only for GHASH. While POLYVAL support could be implemented on these other architectures, until then we need to support the case where arch-optimized functions are present only for GHASH. Therefore, update the support for arch-optimized POLYVAL functions to allow architectures to opt into supporting these functions individually. The new meaning of CONFIG_CRYPTO_LIB_GF128HASH_ARCH is that some level of GHASH and/or POLYVAL acceleration is provided. Also provide an implementation of polyval_mul() based on polyval_blocks_arch(), for when polyval_mul_arch() isn't implemented. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260319061723.1140720-3-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: gf128hash: Rename polyval module to gf128hash

2026-03-23T20:15:13+00:00

Currently, the standalone GHASH code is coupled with crypto_shash. This has resulted in unnecessary complexity and overhead, as well as the code being unavailable to library code such as the AES-GCM library. Like was done with POLYVAL, it needs to find a new home in lib/crypto/. GHASH and POLYVAL are closely related and can each be implemented in terms of each other. Optimized code for one can be reused with the other. But also since GHASH tends to be difficult to implement directly due to its unnatural bit order, most modern GHASH implementations (including the existing arm, arm64, powerpc, and x86 optimized GHASH code, and the new generic GHASH code I'll be adding) actually reinterpret the GHASH computation as an equivalent POLYVAL computation, pre and post-processing the inputs and outputs to map to/from POLYVAL. Given this close relationship, it makes sense to group the GHASH and POLYVAL code together in the same module. This gives us a wide range of options for implementing them, reusing code between the two and properly utilizing whatever instructions each architecture provides. Thus, GHASH support will be added to the library module that is currently called "polyval". Rename it to an appropriate name: "gf128hash". Rename files, options, functions, etc. where appropriate to reflect the upcoming sharing with GHASH. (Note: polyval_kunit is not renamed, as ghash_kunit will be added alongside it instead.) Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260319061723.1140720-2-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/sha256: PHE Extensions optimized SHA256 transform function

2026-03-14T18:44:18+00:00

Zhaoxin CPUs have implemented the SHA(Secure Hash Algorithm) as its CPU instructions by PHE(Padlock Hash Engine) Extensions, including XSHA1, XSHA256, XSHA384 and XSHA512 instructions. The instruction specification is available at the following link. (https://gitee.com/openzhaoxin/zhaoxin_specifications/blob/20260227/ZX_Padlock_Reference.pdf) With the help of implementation of SHA in hardware instead of software, can develop applications with higher performance, more security and more flexibility. This patch includes the XSHA256 instruction optimized implementation of SHA-256 transform function. The table below shows the benchmark results before and after applying this patch by using CRYPTO_LIB_BENCHMARK on Zhaoxin KX-7000 platform, highlighting the achieved speedups. +---------+--------------------------+ | | SHA256 | +---------+--------+-----------------+ | Len | Before | After | +---------+--------+-----------------+ | 1* | 2 | 7 (3.50x) | | 16 | 35 | 119 (3.40x) | | 64 | 74 | 280 (3.78x) | | 127 | 99 | 387 (3.91x) | | 128 | 103 | 427 (4.15x) | | 200 | 123 | 537 (4.37x) | | 256 | 128 | 582 (4.55x) | | 511 | 144 | 679 (4.72x) | | 512 | 146 | 714 (4.89x) | | 1024 | 157 | 796 (5.07x) | | 3173 | 167 | 883 (5.28x) | | 4096 | 166 | 876 (5.28x) | | 16384 | 169 | 899 (5.32x) | +---------+--------+-----------------+ *: The length of each data block to be processed by one complete SHA sequence. **: The throughput of processing data blocks, unit is Mb/s. After applying this patch, the SHA256 KUnit test suite passes on Zhaoxin platforms. Detailed test logs are shown below. [ 7.767257] # Subtest: sha256 [ 7.770542] # module: sha256_kunit [ 7.770544] 1..15 [ 7.777383] ok 1 test_hash_test_vectors [ 7.788563] ok 2 test_hash_all_lens_up_to_4096 [ 7.806090] ok 3 test_hash_incremental_updates [ 7.813553] ok 4 test_hash_buffer_overruns [ 7.822384] ok 5 test_hash_overlaps [ 7.829388] ok 6 test_hash_alignment_consistency [ 7.833843] ok 7 test_hash_ctx_zeroization [ 7.915191] ok 8 test_hash_interrupt_context_1 [ 8.362312] ok 9 test_hash_interrupt_context_2 [ 8.401607] ok 10 test_hmac [ 8.415458] ok 11 test_sha256_finup_2x [ 8.419397] ok 12 test_sha256_finup_2x_defaultctx [ 8.424107] ok 13 test_sha256_finup_2x_hugelen [ 8.451289] # benchmark_hash: len=1: 7 MB/s [ 8.465372] # benchmark_hash: len=16: 119 MB/s [ 8.481760] # benchmark_hash: len=64: 280 MB/s [ 8.499344] # benchmark_hash: len=127: 387 MB/s [ 8.515800] # benchmark_hash: len=128: 427 MB/s [ 8.531970] # benchmark_hash: len=200: 537 MB/s [ 8.548241] # benchmark_hash: len=256: 582 MB/s [ 8.564838] # benchmark_hash: len=511: 679 MB/s [ 8.580872] # benchmark_hash: len=512: 714 MB/s [ 8.596858] # benchmark_hash: len=1024: 796 MB/s [ 8.612567] # benchmark_hash: len=3173: 883 MB/s [ 8.628546] # benchmark_hash: len=4096: 876 MB/s [ 8.644482] # benchmark_hash: len=16384: 899 MB/s [ 8.649773] ok 14 benchmark_hash [ 8.655505] ok 15 benchmark_sha256_finup_2x # SKIP not relevant [ 8.659065] # sha256: pass:14 fail:0 skip:1 total:15 [ 8.665276] # Totals: pass:14 fail:0 skip:1 total:15 [ 8.670195] ok 7 sha256 Signed-off-by: AlanSong-oc Link: https://lore.kernel.org/r/20260313080150.9393-3-AlanSong-oc@zhaoxin.com Signed-off-by: Eric Biggers

lib/crypto: x86/aes: Add AES-NI optimization

2026-01-15T22:09:07+00:00

Optimize the AES library with x86 AES-NI instructions. The relevant existing assembly functions, aesni_set_key(), aesni_enc(), and aesni_dec(), are a bit difficult to extract into the library: - They're coupled to the code for the AES modes. - They operate on struct crypto_aes_ctx. The AES library now uses different structs. - They assume the key is 16-byte aligned. The AES library only *prefers* 16-byte alignment; it doesn't require it. Moreover, they're not all that great in the first place: - They use unrolled loops, which isn't a great choice on x86. - They use the 'aeskeygenassist' instruction, which is unnecessary, is slow on Intel CPUs, and forces the loop to be unrolled. - They have special code for AES-192 key expansion, despite that being kind of useless. AES-128 and AES-256 are the ones used in practice. These are small functions anyway. Therefore, I opted to just write replacements of these functions for the library. They address all the above issues. Acked-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20260112192035.10427-18-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/nh: Migrate optimized code into library

2026-01-12T19:07:50+00:00

Migrate the x86_64 implementations of NH into lib/crypto/. This makes the nh() function be optimized on x86_64 kernels. Note: this temporarily makes the adiantum template not utilize the x86_64 optimized NH code. This is resolved in a later commit that converts the adiantum template to use nh() instead of "nhpoly1305". Link: https://lore.kernel.org/r/20251211011846.8179-6-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/polyval: Migrate optimized code into library

2025-11-11T19:03:38+00:00

Migrate the x86_64 implementation of POLYVAL into lib/crypto/, wiring it up to the POLYVAL library interface. This makes the POLYVAL library be properly optimized on x86_64. This drops the x86_64 optimizations of polyval in the crypto_shash API. That's fine, since polyval will be removed from crypto_shash entirely since it is unneeded there. But even if it comes back, the crypto_shash API could just be implemented on top of the library API, as usual. Adjust the names and prototypes of the assembly functions to align more closely with the rest of the library code. Also replace a movaps instruction with movups to remove the assumption that the key struct is 16-byte aligned. Users can still align the key if they want (and at least in this case, movups is just as fast as movaps), but it's inconvenient to require it. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20251109234726.638437-6-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/blake2s: Use vpternlogd for 3-input XORs

2025-11-06T04:30:52+00:00

AVX-512 supports 3-input XORs via the vpternlogd (or vpternlogq) instruction with immediate 0x96. This approach, vs. the alternative of two vpxor instructions, is already used in the CRC, AES-GCM, and AES-XTS code, since it reduces the instruction count and is faster on some CPUs. Make blake2s_compress_avx512() take advantage of it too. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20251102234209.62133-7-ebiggers@kernel.org Signed-off-by: Eric Biggers

lib/crypto: x86/blake2s: Avoid writing back unchanged 'f' value

2025-11-06T04:30:52+00:00

Just before returning, blake2s_compress_ssse3() and blake2s_compress_avx512() store updated values to the 'h', 't', and 'f' fields of struct blake2s_ctx. But 'f' is always unchanged (which is correct; only the C code changes it). So, there's no need to write to 'f'. Use 64-bit stores (movq and vmovq) instead of 128-bit stores (movdqu and vmovdqu) so that only 't' is written. Reviewed-by: Ard Biesheuvel Link: https://lore.kernel.org/r/20251102234209.62133-6-ebiggers@kernel.org Signed-off-by: Eric Biggers