[Implemented] Faster CRC32 method for SFV computation.

lelik007 · Post by *lelik007 » 2024-10-11, 11:09 UTC

Christian I'm sorry for maybe being too late for this 11.50 beta but I should address this. TC uses as I think standard CRC32 implementation by Dilip V. Sarwate (maybe in Assembly) for SFV computation and it gives to 375 Mbyte/s according to this https://create.stephan-brumme.com/crc32 and my personal measurements.

So why not to speed CRC32 up in TC? You can use Slicing-by-N for example.

If I understand the things right the method with the sliced tables was brought by Intel and initially was for CRC32c(Castagnoli) but then was adopted for CRC32:
https://sourceforge.net/projects/slicing-by-8/

Slicing-by-N algos can contain literally the lookup tables in their code - table-based: Zlib, libdeflate (crc32_tables.h for example), Fast CRC32 by Stephan Brumme or tableless where the lookup tables are sliced yet calculated by the algos: 7-zip - Slicing-by-12, UnRAR- Slicing-by-16.
But every method based on Intel Slicing-by-8 gives ~ 1000 Mbyte/s.

Any Slicing-by-N method with Stephan Brumme's optimizations is available here. He's a relatively known researcher and a collector of CRC32 methods:
https://github.com/stbrumme/crc32

And as browny said below libdeflate and Zlib use their own methods based on Intel Slicing-by-8 similar to Fast CRC32 by Stephan Brumme with just the same lookup tables.

You can choose either one.

As for x32 I don't see Slicing-by-8 does any harm for RapidCRC Unicode x32 though it performs a bit slower ~ 900 Mbyte/s. In this case I think maybe even Slicing-by-4 could be a better option or just an old code. For x64 and the modern machines Slicing-by-16 can give a better performance but Slicing-by-8 is more conservative.

browny · Post by *browny » 2024-10-11, 19:12 UTC

lelik007 wrote: 2024-10-11, 11:09 UTC TC uses as I think standard CRC32 implementation by Dilip V. Sarwate (maybe in Assembly)

Why you think so?
Probably in every decent library calculation is table-based (zlib, libdeflate).
LIBDEFLATE64.DLL exports libdeflate_crc32, and TC imports this function.

JOUBE · Post by *JOUBE » 2024-10-12, 06:01 UTC

Really, CRC32 improvements: who cares...

lelik007 · Post by *lelik007 » 2024-10-12, 08:22 UTC

2browny
IDK what exact code is used in TC's checksum module, that was my guess. But I measured right now, Mbyte/s means exactly it: Megabytes / Time and it's about 330 Mbyte/s, if it were something table-based it'd be more like ~ 1000 Mbyte/s minimum. And, yes, you are right, libdeflate has Slicing-by-8 code but It doesn't mean it affects TC's checksum module in any way.

And this is exactly why I started this topic, especially if libdeflate has Slicing-by-8 (I didn't know it) and 7za.dll has Igor's implementation (how it's calculated for 7z archives, otherwise) why do we still have so slow method for the actual checksum module.

2JOUBE
At least this is me.

browny · Post by *browny » 2024-10-12, 11:23 UTC

Without knowledge you could test only program performance which is in any case severely hindered by file system (most probably, synchronous disk I/O). This cannot be compared directly to method perfomance.
For the reference, average (or maybe higher than average) HDD might give 200 MB/s peak and substantially lower sustained rate for file operations.
300 MB/s suggests a very efficient implementation.

lelik007 · Post by *lelik007 » 2024-10-12, 14:01 UTC

2browny
I'll show you then. 2 builds of RapidCRC Unicode with the only change log:
---
v0.3.40
posted on 03/29/2024 - 16:55
Changed crc implementation to slice-by-8 from https://github.com/stbrumme/crc32
Enabled SSE4.1 for Blake2sp on x64 builds
---
I asked the developer to make these changes because up to v0.3.39 there was compact tableless Assembly code.
[img]https://www.upload.ee/thumb/17231641/CRC32comparison.jpg[/img]
And 7-zip:
[img]https://www.upload.ee/thumb/17254822/7-zipCRC32.jpg[/img]

It's just an old code in TC and maybe Christian will decide to update it at some point I'm just showing what we'll get.

lelik007 · Post by *lelik007 » 2024-11-20, 17:12 UTC

2ghisler(Author)
Thank you Christian, this is what I meant, now CRC32 has the same speed as BLAKE3 without AVX2 (if there's no this SIMD set in CPU) to ~ 1200 Mbyte/s, of course with AVX2 set BLAKE3 is ~ 1800 Mbyte/s.

Post by *ghisler(Author) » 2024-11-20, 17:38 UTC

Yes, it's astonishing, I had to test on my primary M.2 SSD (4 lane PCIe 4.0) because it was faster than my data SSD (SATA, 500MB/s)...

lelik007 · Post by *lelik007 » 2024-11-20, 18:08 UTC

2ghisler(Author)
Igor Pavlov actually has a big brother: Slicing-by-12, 12 Kb Tables (Assembly Code with a table generator) but I don't know how to call this function out from 7z.dll.
https://www.upload.ee/image/17254822/7-zipCRC32.jpg
But as for me, I'll be more than happy with libdeflate's Slicing-by-8 because we'd already had it before I even asked and 1100-1200 Mbytes/s is a nice speed to have.

lelik007 · Post by *lelik007 » 2024-11-20, 18:21 UTC

2ghisler(Author)
I think this topic can be closed as "Implemented" now.

Post by *white » 2024-11-20, 20:19 UTC

lelik007 wrote: 2024-11-20, 18:21 UTC I think this topic can be closed as "Implemented" now.

By editing the subject of the first post, you can change the subject of the thread yourself.

lelik007 · Post by *lelik007 » 2024-11-21, 07:03 UTC

2white
Thank you, I understood and did it.

Total Commander

[Implemented] Faster CRC32 method for SFV computation.

[Implemented] Faster CRC32 method for SFV computation.

Re: Faster CRC32 Method.

Re: Faster CRC32 Method.

Re: Faster CRC32 Method.

Re: Faster CRC32 Method.

Re: Faster CRC32 Method.

Re: Faster CRC32 method for SFV computation.

Re: Faster CRC32 method for SFV computation.

Re: Faster CRC32 method for SFV computation.

Re: Faster CRC32 method for SFV computation.

Re: Faster CRC32 method for SFV computation.

Re: [Implemented] Faster CRC32 method for SFV computation.