SHA-3 Speed Improvent.
Moderators: Hacker, petermad, Stefan2, white
SHA-3 Speed Improvent.
2ghisler(Author)
Hello, Christian! I'm yet again about the speed improvement, SHA-3 this time.
I measured SHA3-256 in TC v11.51 x86-64 and found out that its speed is 130 Mib/s though 7-Zip and RapidCRC Unicode have the speed of 220 Mib/s at the same PC.
It looks like TC uses 1-st generation of SHA-3 code which was really slow or maybe it was done with an old compiler.
As for the code, C/C++/Delphi could be found here (developers' page):
https://keccak.team/software.html
For x86 I don't think anything can be done, but for x86-64 there's a hope that you can find appropriate for TC faster code.
Hello, Christian! I'm yet again about the speed improvement, SHA-3 this time.
I measured SHA3-256 in TC v11.51 x86-64 and found out that its speed is 130 Mib/s though 7-Zip and RapidCRC Unicode have the speed of 220 Mib/s at the same PC.
It looks like TC uses 1-st generation of SHA-3 code which was really slow or maybe it was done with an old compiler.
As for the code, C/C++/Delphi could be found here (developers' page):
https://keccak.team/software.html
For x86 I don't think anything can be done, but for x86-64 there's a hope that you can find appropriate for TC faster code.
- ghisler(Author)
- Site Admin
- Posts: 50383
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: SHA-3 Speed Improvent.
I will have a look at it. Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?
Btw, is there a specific reason why you have to use SHA3? I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration. Sadly Microsoft Crypto API doesn't support SHA3.
Btw, is there a specific reason why you have to use SHA3? I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration. Sadly Microsoft Crypto API doesn't support SHA3.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: SHA-3 Speed Improvent.
2ghisler(Author)
Maybe it's a good thing to check out first?
https://github.com/MHumm/DelphiEncryptionCompendium
https://github.com/XKCP/XKCP
or for C++ CryptoPP could do the job
http://www.cryptopp.com/
https://github.com/weidai11/cryptopp
For me it's for checking not for creating.
Hardware accelerated code for SHA-1/SHA-256 is a different story, 1900-2200 MiB/s depending on a PC.
BTW, Microsoft Crypto API could have SHA-1 hardware acceleration, because just the same SIMD set is used for both SHA-1/SHA-256, but SHA-1 in TC doesn't have hardware acceleration yet.
Maybe it's a good thing to check out first?
https://github.com/MHumm/DelphiEncryptionCompendium
In this case I'd recommend to start from the official Keccak Team code, it's here:Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?
https://github.com/XKCP/XKCP
or for C++ CryptoPP could do the job
http://www.cryptopp.com/
https://github.com/weidai11/cryptopp
Yes, many developers nowadays started to provide the checksums for their products or sources in SHA3-256 only for whatever reason.Btw, is there a specific reason why you have to use SHA3?
For me it's for checking not for creating.
Neither one of my PCs has Intel SHA Extensions aka SHA SIMDs for hardware acceleration, unfortunately, so I use BLAKE3 for my own purposes.I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.
Hardware accelerated code for SHA-1/SHA-256 is a different story, 1900-2200 MiB/s depending on a PC.
BTW, Microsoft Crypto API could have SHA-1 hardware acceleration, because just the same SIMD set is used for both SHA-1/SHA-256, but SHA-1 in TC doesn't have hardware acceleration yet.
Last edited by lelik007 on 2025-03-31, 08:21 UTC, edited 5 times in total.
Re: SHA-3 Speed Improvent.
2ghisler(Author)
There's actually another way to get SHA-3, if a user has OpenSSL v1.1.1 - v3 installed, or the libcrypto-1_1-x64.dll / libcrypto-3-x64.dll unpacked to TC folder we have SHA-3 family in there, but OpenSSL is an optional component for TC, we can't have it as the main source of SHA-3, yet we can have it as the optional source.
For me, personally if TC could use OpenSSL library v1.1.1 - v3 for SHA-3 hashing would be enough.
There's actually another way to get SHA-3, if a user has OpenSSL v1.1.1 - v3 installed, or the libcrypto-1_1-x64.dll / libcrypto-3-x64.dll unpacked to TC folder we have SHA-3 family in there, but OpenSSL is an optional component for TC, we can't have it as the main source of SHA-3, yet we can have it as the optional source.
For me, personally if TC could use OpenSSL library v1.1.1 - v3 for SHA-3 hashing would be enough.
- ghisler(Author)
- Site Admin
- Posts: 50383
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: SHA-3 Speed Improvent.
I have made some tests now with the official Keccak Team code in C you recommended.
Unfortunately it uses GCC assembler, so I can't use the assembler parts in Visual Studio.
Therefore I could only test the C implementations. Btw, my internal Delphi/Lazarus code
already uses Keccak!
Here are speed comparisons for a 6.5GByte files for SHA3-256:
internal: 30.1s
C code, ref-64bits: 144.8s
C code, plain-64bit: 28.4s
C code, avx-512: 13.7s
So the reference implementation is very slow, and the plain-64bit implementation is only slightly faster than my internal code.
The avx512 implementation is 3 times faster, but still half as fast as SHA2_256 (6.1 seconds).
Unfortunately modern Intel processors no longer support avx-512 because the efficiency cores are missing it.
Therefore I would prefer to use the avx-2 implementation, but it's in assembly code which masm doesn't unterstand.
Unfortunately it uses GCC assembler, so I can't use the assembler parts in Visual Studio.
Therefore I could only test the C implementations. Btw, my internal Delphi/Lazarus code
already uses Keccak!
Here are speed comparisons for a 6.5GByte files for SHA3-256:
internal: 30.1s
C code, ref-64bits: 144.8s
C code, plain-64bit: 28.4s
C code, avx-512: 13.7s
So the reference implementation is very slow, and the plain-64bit implementation is only slightly faster than my internal code.
The avx512 implementation is 3 times faster, but still half as fast as SHA2_256 (6.1 seconds).
Unfortunately modern Intel processors no longer support avx-512 because the efficiency cores are missing it.
Therefore I would prefer to use the avx-2 implementation, but it's in assembly code which masm doesn't unterstand.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: SHA-3 Speed Improvent.
2ghisler(Author)
All I see in 7-zip, is SHA3-256 - 225 Mib/s i7-2600K (no AVX2), 408 Mib/s i5-10500 (AVX2 indeed no AVX-512).
And these binaries give about the same result:
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
This Delphi code isn't faster? it's mentioned as New SHA-3 permutation kernel by Eric Grange on Keccak Team page.
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/dwsSHA3.pas
CryptoPP, https://cryptopp.com/ is well-know and used in different SW, though it's C++, but it can be compiled in Visual Studio 2003 - 2022.
All I see in 7-zip, is SHA3-256 - 225 Mib/s i7-2600K (no AVX2), 408 Mib/s i5-10500 (AVX2 indeed no AVX-512).
And these binaries give about the same result:
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
This Delphi code isn't faster? it's mentioned as New SHA-3 permutation kernel by Eric Grange on Keccak Team page.
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/dwsSHA3.pas
CryptoPP, https://cryptopp.com/ is well-know and used in different SW, though it's C++, but it can be compiled in Visual Studio 2003 - 2022.
My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then? Maybe Microsoft Crypto API doesn't have hardware acceleration for SHA-1?I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.
Last edited by lelik007 on 2025-04-10, 06:07 UTC, edited 1 time in total.
- ghisler(Author)
- Site Admin
- Posts: 50383
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: SHA-3 Speed Improvent.
No, unfortunately it isn't. I have now tried it, and repeated the tests with the entire file in cache:This Delphi code isn't faster?
old internal Delphi code: 30.1s
new internal Delphi code: 29.8s
That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.
So although the two codes look quite different, they essentially do the same. They are also on par with the plain-64bit C code.
Only the hardware accelerated codes seem to be faster.
I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then?
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: SHA-3 Speed Improvent.
2ghisler(Author)
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
11 Gen is the only Intel Desktop CPUs that have AVX-512F, previous and next generations don't have this set.
And as you measured:
= 485 MiB/s, yes this what I expected, not 1-2 GiB/s for sure, but the speed of 400-500 MiB/s at least fits SATA-3 SSD.
But of course any SHA-3 variety isn't spread so wide that I need it every day for example.
Yes, but of course they can't compete with the dedicated SIMD set as SHA-1/SHA-256 has.Only the hardware accelerated codes seem to be faster.
AVX2 version from here should perform slightly better ~ 400-420 MiB/s.That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
11 Gen is the only Intel Desktop CPUs that have AVX-512F, previous and next generations don't have this set.
And as you measured:
C code, avx-512: 13.7s
= 485 MiB/s, yes this what I expected, not 1-2 GiB/s for sure, but the speed of 400-500 MiB/s at least fits SATA-3 SSD.
But of course any SHA-3 variety isn't spread so wide that I need it every day for example.
Thank you, I understand Microsoft Crypto API doesn't have hardware acceleration for SHA-1.I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).
- ghisler(Author)
- Site Admin
- Posts: 50383
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: SHA-3 Speed Improvent.
OK, I have converted the GNU Assembler code to Intel with the following trick:
1. Compiled the KeccakP-1600-AVX2.s code with GNU assembler (as) on Ubuntu to a.out
2. Decompiled a.out on Windows with objconv.exe: objconv.exe -fmasm a.out a.asm
3. Now the real "fun" started: GCC uses a different calling convention than Microsoft Visual C++, so I had to move the parameters between registers in various functions. While GCC passes the first 6 parameters in registere RDI, RSI, RDX, RCX, R8, and R9, VCC64 passes the first four in RCX, RDX, R8, R9.
Result: It works! It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code. And the DLL is just 90kBytes, so I will definitively include it in the next 64-bit version.
1. Compiled the KeccakP-1600-AVX2.s code with GNU assembler (as) on Ubuntu to a.out
2. Decompiled a.out on Windows with objconv.exe: objconv.exe -fmasm a.out a.asm
3. Now the real "fun" started: GCC uses a different calling convention than Microsoft Visual C++, so I had to move the parameters between registers in various functions. While GCC passes the first 6 parameters in registere RDI, RSI, RDX, RCX, R8, and R9, VCC64 passes the first four in RCX, RDX, R8, R9.
Result: It works! It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code. And the DLL is just 90kBytes, so I will definitively include it in the next 64-bit version.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: SHA-3 Speed Improvent.
2ghisler(Author)
The devs wrote that there should be something, without any assembly needed:
https://github.com/XKCP/XKCP?tab=readme-ov-file#microsoft-visual-studio-support
This is about the same result I got with i5-10500 ~ 410 MiB/s, I hope It has a fallback for the users with older CPUs without AVX2.It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code.
The devs wrote that there should be something, without any assembly needed:
Read the section linked below, maybe It helps to implement this thing better.Note that the AVX2noAsm and AVX512noAsm targets provide alternatives to AVX2 and AVX512, respectively, without assembly implementations.
https://github.com/XKCP/XKCP?tab=readme-ov-file#microsoft-visual-studio-support
- ghisler(Author)
- Site Admin
- Posts: 50383
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: SHA-3 Speed Improvent.
This doesn't compile here, it returns several errors. My asm port works fine so far.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: SHA-3 Speed Improvent.
2ghisler(Author)
Ok! It's a good thing to give SHA-3 some boost, thank you, I'll test it during the next beta.
Ok! It's a good thing to give SHA-3 some boost, thank you, I'll test it during the next beta.