SHA-3 Speed Improvent.

Here you can propose new features, make suggestions etc.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
Hello, Christian! I'm yet again about the speed improvement, SHA-3 this time.

I measured SHA3-256 in TC v11.51 x86-64 and found out that its speed is 130 Mib/s though 7-Zip and RapidCRC Unicode have the speed of 220 Mib/s at the same PC.

It looks like TC uses 1-st generation of SHA-3 code which was really slow or maybe it was done with an old compiler.
As for the code, C/C++/Delphi could be found here (developers' page):
https://keccak.team/software.html

For x86 I don't think anything can be done, but for x86-64 there's a hope that you can find appropriate for TC faster code.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50383
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: SHA-3 Speed Improvent.

Post by *ghisler(Author) »

I will have a look at it. Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?

Btw, is there a specific reason why you have to use SHA3? I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration. Sadly Microsoft Crypto API doesn't support SHA3.
Author of Total Commander
https://www.ghisler.com
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
Maybe it's a good thing to check out first?
https://github.com/MHumm/DelphiEncryptionCompendium
Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?
In this case I'd recommend to start from the official Keccak Team code, it's here:
https://github.com/XKCP/XKCP
or for C++ CryptoPP could do the job
http://www.cryptopp.com/
https://github.com/weidai11/cryptopp
Btw, is there a specific reason why you have to use SHA3?
Yes, many developers nowadays started to provide the checksums for their products or sources in SHA3-256 only for whatever reason.
For me it's for checking not for creating.
I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.
Neither one of my PCs has Intel SHA Extensions aka SHA SIMDs for hardware acceleration, unfortunately, so I use BLAKE3 for my own purposes.

Hardware accelerated code for SHA-1/SHA-256 is a different story, 1900-2200 MiB/s depending on a PC.
BTW, Microsoft Crypto API could have SHA-1 hardware acceleration, because just the same SIMD set is used for both SHA-1/SHA-256, but SHA-1 in TC doesn't have hardware acceleration yet.
Last edited by lelik007 on 2025-03-31, 08:21 UTC, edited 5 times in total.
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
There's actually another way to get SHA-3, if a user has OpenSSL v1.1.1 - v3 installed, or the libcrypto-1_1-x64.dll / libcrypto-3-x64.dll unpacked to TC folder we have SHA-3 family in there, but OpenSSL is an optional component for TC, we can't have it as the main source of SHA-3, yet we can have it as the optional source.

For me, personally if TC could use OpenSSL library v1.1.1 - v3 for SHA-3 hashing would be enough.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50383
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: SHA-3 Speed Improvent.

Post by *ghisler(Author) »

I have made some tests now with the official Keccak Team code in C you recommended.
Unfortunately it uses GCC assembler, so I can't use the assembler parts in Visual Studio.
Therefore I could only test the C implementations. Btw, my internal Delphi/Lazarus code
already uses Keccak!

Here are speed comparisons for a 6.5GByte files for SHA3-256:
internal: 30.1s
C code, ref-64bits: 144.8s
C code, plain-64bit: 28.4s
C code, avx-512: 13.7s

So the reference implementation is very slow, and the plain-64bit implementation is only slightly faster than my internal code.
The avx512 implementation is 3 times faster, but still half as fast as SHA2_256 (6.1 seconds).
Unfortunately modern Intel processors no longer support avx-512 because the efficiency cores are missing it.
Therefore I would prefer to use the avx-2 implementation, but it's in assembly code which masm doesn't unterstand.
Author of Total Commander
https://www.ghisler.com
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
All I see in 7-zip, is SHA3-256 - 225 Mib/s i7-2600K (no AVX2), 408 Mib/s i5-10500 (AVX2 indeed no AVX-512).
And these binaries give about the same result:
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip

This Delphi code isn't faster? it's mentioned as New SHA-3 permutation kernel by Eric Grange on Keccak Team page.
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/dwsSHA3.pas

CryptoPP, https://cryptopp.com/ is well-know and used in different SW, though it's C++, but it can be compiled in Visual Studio 2003 - 2022.
I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.
My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then? Maybe Microsoft Crypto API doesn't have hardware acceleration for SHA-1?
Last edited by lelik007 on 2025-04-10, 06:07 UTC, edited 1 time in total.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50383
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: SHA-3 Speed Improvent.

Post by *ghisler(Author) »

This Delphi code isn't faster?
No, unfortunately it isn't. I have now tried it, and repeated the tests with the entire file in cache:
old internal Delphi code: 30.1s
new internal Delphi code: 29.8s
That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.
So although the two codes look quite different, they essentially do the same. They are also on par with the plain-64bit C code.
Only the hardware accelerated codes seem to be faster.
My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then?
I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).
Author of Total Commander
https://www.ghisler.com
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
Only the hardware accelerated codes seem to be faster.
Yes, but of course they can't compete with the dedicated SIMD set as SHA-1/SHA-256 has.
That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.
AVX2 version from here should perform slightly better ~ 400-420 MiB/s.
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
11 Gen is the only Intel Desktop CPUs that have AVX-512F, previous and next generations don't have this set.
And as you measured:
C code, avx-512: 13.7s

= 485 MiB/s, yes this what I expected, not 1-2 GiB/s for sure, but the speed of 400-500 MiB/s at least fits SATA-3 SSD.

But of course any SHA-3 variety isn't spread so wide that I need it every day for example.
I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).
Thank you, I understand Microsoft Crypto API doesn't have hardware acceleration for SHA-1.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50383
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: SHA-3 Speed Improvent.

Post by *ghisler(Author) »

OK, I have converted the GNU Assembler code to Intel with the following trick:
1. Compiled the KeccakP-1600-AVX2.s code with GNU assembler (as) on Ubuntu to a.out
2. Decompiled a.out on Windows with objconv.exe: objconv.exe -fmasm a.out a.asm
3. Now the real "fun" started: GCC uses a different calling convention than Microsoft Visual C++, so I had to move the parameters between registers in various functions. While GCC passes the first 6 parameters in registere RDI, RSI, RDX, RCX, R8, and R9, VCC64 passes the first four in RCX, RDX, R8, R9.

Result: It works! It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code. And the DLL is just 90kBytes, so I will definitively include it in the next 64-bit version.
Author of Total Commander
https://www.ghisler.com
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code.
This is about the same result I got with i5-10500 ~ 410 MiB/s, I hope It has a fallback for the users with older CPUs without AVX2.


The devs wrote that there should be something, without any assembly needed:
Note that the AVX2noAsm and AVX512noAsm targets provide alternatives to AVX2 and AVX512, respectively, without assembly implementations.
Read the section linked below, maybe It helps to implement this thing better.
https://github.com/XKCP/XKCP?tab=readme-ov-file#microsoft-visual-studio-support
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50383
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: SHA-3 Speed Improvent.

Post by *ghisler(Author) »

This doesn't compile here, it returns several errors. My asm port works fine so far.
Author of Total Commander
https://www.ghisler.com
lelik007
Member
Member
Posts: 173
Joined: 2021-04-20, 06:37 UTC

Re: SHA-3 Speed Improvent.

Post by *lelik007 »

2ghisler(Author)
Ok! It's a good thing to give SHA-3 some boost, thank you, I'll test it during the next beta.
Post Reply