[Implemented] SHA-3 Speed Improvent.

lelik007 · Post by *lelik007 » 2025-03-29, 15:48 UTC

2ghisler(Author)
Hello, Christian! I'm yet again about the speed improvement, SHA-3 this time.

I measured SHA3-256 in TC v11.51 x86-64 and found out that its speed is 130 Mib/s though 7-Zip and RapidCRC Unicode have the speed of 220 Mib/s at the same PC.

It looks like TC uses 1-st generation of SHA-3 code which was really slow or maybe it was done with an old compiler.
As for the code, C/C++/Delphi could be found here (developers' page):
https://keccak.team/software.html

For x86 I don't think anything can be done, but for x86-64 there's a hope that you can find appropriate for TC faster code.

Post by *ghisler(Author) » 2025-03-30, 08:28 UTC

I will have a look at it. Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?

Btw, is there a specific reason why you have to use SHA3? I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration. Sadly Microsoft Crypto API doesn't support SHA3.

lelik007 · Post by *lelik007 » 2025-03-30, 14:21 UTC

2ghisler(Author)
Maybe it's a good thing to check out first?
https://github.com/MHumm/DelphiEncryptionCompendium

Maybe I will use a C library in a DLL as I do for e.g. Blake3. Any recommendations?

In this case I'd recommend to start from the official Keccak Team code, it's here:
https://github.com/XKCP/XKCP
or for C++ CryptoPP could do the job
http://www.cryptopp.com/
https://github.com/weidai11/cryptopp

Btw, is there a specific reason why you have to use SHA3?

Yes, many developers nowadays started to provide the checksums for their products or sources in SHA3-256 only for whatever reason.
For me it's for checking not for creating.

I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.

Neither one of my PCs has Intel SHA Extensions aka SHA SIMDs for hardware acceleration, unfortunately, so I use BLAKE3 for my own purposes.

Hardware accelerated code for SHA-1/SHA-256 is a different story, 1900-2200 MiB/s depending on a PC.
BTW, Microsoft Crypto API could have SHA-1 hardware acceleration, because just the same SIMD set is used for both SHA-1/SHA-256, but SHA-1 in TC doesn't have hardware acceleration yet.

lelik007 · Post by *lelik007 » 2025-03-31, 06:37 UTC

2ghisler(Author)
There's actually another way to get SHA-3, if a user has OpenSSL v1.1.1 - v3 installed, or the libcrypto-1_1-x64.dll / libcrypto-3-x64.dll unpacked to TC folder we have SHA-3 family in there, but OpenSSL is an optional component for TC, we can't have it as the main source of SHA-3, yet we can have it as the optional source.

For me, personally if TC could use OpenSSL library v1.1.1 - v3 for SHA-3 hashing would be enough.

Post by *ghisler(Author) » 2025-04-02, 10:08 UTC

I have made some tests now with the official Keccak Team code in C you recommended.
Unfortunately it uses GCC assembler, so I can't use the assembler parts in Visual Studio.
Therefore I could only test the C implementations. Btw, my internal Delphi/Lazarus code
already uses Keccak!

Here are speed comparisons for a 6.5GByte files for SHA3-256:
internal: 30.1s
C code, ref-64bits: 144.8s
C code, plain-64bit: 28.4s
C code, avx-512: 13.7s

So the reference implementation is very slow, and the plain-64bit implementation is only slightly faster than my internal code.
The avx512 implementation is 3 times faster, but still half as fast as SHA2_256 (6.1 seconds).
Unfortunately modern Intel processors no longer support avx-512 because the efficiency cores are missing it.
Therefore I would prefer to use the avx-2 implementation, but it's in assembly code which masm doesn't unterstand.

lelik007 · Post by *lelik007 » 2025-04-02, 13:35 UTC

2ghisler(Author)
All I see in 7-zip, is SHA3-256 - 225 Mib/s i7-2600K (no AVX2), 408 Mib/s i5-10500 (AVX2 indeed no AVX-512).
And these binaries give about the same result:
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip

This Delphi code isn't faster? it's mentioned as New SHA-3 permutation kernel by Eric Grange on Keccak Team page.
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/
https://bitbucket.org/egrange/dwscript/src/master/Libraries/CryptoLib/dwsSHA3.pas

CryptoPP, https://cryptopp.com/ is well-know and used in different SW, though it's C++, but it can be compiled in Visual Studio 2003 - 2022.

I ask because SHA2-256 would be much faster: I'm using Microsoft Crypto API for SHA2-256, which uses hardware acceleration.

My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then? Maybe Microsoft Crypto API doesn't have hardware acceleration for SHA-1?

Post by *ghisler(Author) » 2025-04-03, 07:21 UTC

This Delphi code isn't faster?

No, unfortunately it isn't. I have now tried it, and repeated the tests with the entire file in cache:
old internal Delphi code: 30.1s
new internal Delphi code: 29.8s
That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.
So although the two codes look quite different, they essentially do the same. They are also on par with the plain-64bit C code.
Only the hardware accelerated codes seem to be faster.

My question still is: why didn't you apply hardware acceleration via Microsoft Crypto API for SHA-1, then?

I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).

lelik007 · Post by *lelik007 » 2025-04-03, 08:52 UTC

2ghisler(Author)

Only the hardware accelerated codes seem to be faster.

Yes, but of course they can't compete with the dedicated SIMD set as SHA-1/SHA-256 has.

That's 222 Mib/s on my i7-11700 (no K, 65W version), so about the same as what you get.

AVX2 version from here should perform slightly better ~ 400-420 MiB/s.
https://keccak.team/files/KeccakSum-binaries-715fbb4d.zip
11 Gen is the only Intel Desktop CPUs that have AVX-512F, previous and next generations don't have this set.
And as you measured:

C code, avx-512: 13.7s

= 485 MiB/s, yes this what I expected, not 1-2 GiB/s for sure, but the speed of 400-500 MiB/s at least fits SATA-3 SSD.

But of course any SHA-3 variety isn't spread so wide that I need it every day for example.

I'm using crypto API for MD5, SHA1, and SHA2 (except for SHA224, which isn't supported).

Thank you, I understand Microsoft Crypto API doesn't have hardware acceleration for SHA-1.

Post by *ghisler(Author) » 2025-04-04, 18:57 UTC

OK, I have converted the GNU Assembler code to Intel with the following trick:
1. Compiled the KeccakP-1600-AVX2.s code with GNU assembler (as) on Ubuntu to a.out
2. Decompiled a.out on Windows with objconv.exe: objconv.exe -fmasm a.out a.asm
3. Now the real "fun" started: GCC uses a different calling convention than Microsoft Visual C++, so I had to move the parameters between registers in various functions. While GCC passes the first 6 parameters in registere RDI, RSI, RDX, RCX, R8, and R9, VCC64 passes the first four in RCX, RDX, R8, R9.

Result: It works! It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code. And the DLL is just 90kBytes, so I will definitively include it in the next 64-bit version.

lelik007 · Post by *lelik007 » 2025-04-04, 20:13 UTC

2ghisler(Author)

It's not as impressive as avx-512, but 16.1s for hashing 6.5GB is twice as fast as the internal code.

This is about the same result I got with i5-10500 ~ 410 MiB/s, I hope It has a fallback for the users with older CPUs without AVX2.

The devs wrote that there should be something, without any assembly needed:

Note that the AVX2noAsm and AVX512noAsm targets provide alternatives to AVX2 and AVX512, respectively, without assembly implementations.

Read the section linked below, maybe It helps to implement this thing better.
https://github.com/XKCP/XKCP?tab=readme-ov-file#microsoft-visual-studio-support

Post by *ghisler(Author) » 2025-04-04, 21:01 UTC

This doesn't compile here, it returns several errors. My asm port works fine so far.

lelik007 · Post by *lelik007 » 2025-04-05, 03:45 UTC

2ghisler(Author)
Ok! It's a good thing to give SHA-3 some boost, thank you, I'll test it during the next beta.

lelik007 · Post by *lelik007 » 2025-05-08, 16:10 UTC

2ghisler(Author)
I've just checked new SHA-3 .dll in TC 11.55 RC with i7-2600k which has no AVX2, the code seems to have no fallback - TC just closes.

Post by *ghisler(Author) » 2025-05-08, 16:24 UTC

That's odd, I'm using the CPUID command to check whether the processor supports AVX2 or not. Does the i7-2600k support AVX?

Just rename or delete the tcsha64.dll for now to use internal SHA3.

lelik007 · Post by *lelik007 » 2025-05-08, 16:29 UTC

2ghisler(Author)

Does the i7-2600k support AVX?

Yes, it has AVX set but doesn't have AVX2.

Total Commander

[Implemented] SHA-3 Speed Improvent.

[Implemented] SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.

Re: SHA-3 Speed Improvent.