[Implemented] Faster CRC32 method for SFV computation.
Moderators: Hacker, petermad, Stefan2, white
[Implemented] Faster CRC32 method for SFV computation.
Christian I'm sorry for maybe being too late for this 11.50 beta but I should address this. TC uses as I think standard CRC32 implementation by Dilip V. Sarwate (maybe in Assembly) for SFV computation and it gives to 375 Mbyte/s according to this https://create.stephan-brumme.com/crc32 and my personal measurements.
So why not to speed CRC32 up in TC? You can use Slicing-by-N for example.
If I understand the things right the method with the sliced tables was brought by Intel and initially was for CRC32c(Castagnoli) but then was adopted for CRC32:
https://sourceforge.net/projects/slicing-by-8/
Slicing-by-N algos can contain literally the lookup tables in their code - table-based: Zlib, libdeflate (crc32_tables.h for example), Fast CRC32 by Stephan Brumme or tableless where the lookup tables are sliced yet calculated by the algos: 7-zip - Slicing-by-12, UnRAR- Slicing-by-16.
But every method based on Intel Slicing-by-8 gives ~ 1000 Mbyte/s.
Any Slicing-by-N method with Stephan Brumme's optimizations is available here. He's a relatively known researcher and a collector of CRC32 methods:
https://github.com/stbrumme/crc32
And as browny said below libdeflate and Zlib use their own methods based on Intel Slicing-by-8 similar to Fast CRC32 by Stephan Brumme with just the same lookup tables.
You can choose either one.
As for x32 I don't see Slicing-by-8 does any harm for RapidCRC Unicode x32 though it performs a bit slower ~ 900 Mbyte/s. In this case I think maybe even Slicing-by-4 could be a better option or just an old code. For x64 and the modern machines Slicing-by-16 can give a better performance but Slicing-by-8 is more conservative.
So why not to speed CRC32 up in TC? You can use Slicing-by-N for example.
If I understand the things right the method with the sliced tables was brought by Intel and initially was for CRC32c(Castagnoli) but then was adopted for CRC32:
https://sourceforge.net/projects/slicing-by-8/
Slicing-by-N algos can contain literally the lookup tables in their code - table-based: Zlib, libdeflate (crc32_tables.h for example), Fast CRC32 by Stephan Brumme or tableless where the lookup tables are sliced yet calculated by the algos: 7-zip - Slicing-by-12, UnRAR- Slicing-by-16.
But every method based on Intel Slicing-by-8 gives ~ 1000 Mbyte/s.
Any Slicing-by-N method with Stephan Brumme's optimizations is available here. He's a relatively known researcher and a collector of CRC32 methods:
https://github.com/stbrumme/crc32
And as browny said below libdeflate and Zlib use their own methods based on Intel Slicing-by-8 similar to Fast CRC32 by Stephan Brumme with just the same lookup tables.
You can choose either one.
As for x32 I don't see Slicing-by-8 does any harm for RapidCRC Unicode x32 though it performs a bit slower ~ 900 Mbyte/s. In this case I think maybe even Slicing-by-4 could be a better option or just an old code. For x64 and the modern machines Slicing-by-16 can give a better performance but Slicing-by-8 is more conservative.
Last edited by lelik007 on 2024-11-21, 07:02 UTC, edited 28 times in total.
Re: Faster CRC32 Method.
Why you think so?lelik007 wrote: 2024-10-11, 11:09 UTC TC uses as I think standard CRC32 implementation by Dilip V. Sarwate (maybe in Assembly)
Probably in every decent library calculation is table-based (zlib, libdeflate).
LIBDEFLATE64.DLL exports libdeflate_crc32, and TC imports this function.
Re: Faster CRC32 Method.
Really, CRC32 improvements: who cares...
Re: Faster CRC32 Method.
2browny
IDK what exact code is used in TC's checksum module, that was my guess. But I measured right now, Mbyte/s means exactly it: Megabytes / Time and it's about 330 Mbyte/s, if it were something table-based it'd be more like ~ 1000 Mbyte/s minimum. And, yes, you are right, libdeflate has Slicing-by-8 code but It doesn't mean it affects TC's checksum module in any way.
And this is exactly why I started this topic, especially if libdeflate has Slicing-by-8 (I didn't know it) and 7za.dll has Igor's implementation (how it's calculated for 7z archives, otherwise) why do we still have so slow method for the actual checksum module.
2JOUBE
At least this is me.
IDK what exact code is used in TC's checksum module, that was my guess. But I measured right now, Mbyte/s means exactly it: Megabytes / Time and it's about 330 Mbyte/s, if it were something table-based it'd be more like ~ 1000 Mbyte/s minimum. And, yes, you are right, libdeflate has Slicing-by-8 code but It doesn't mean it affects TC's checksum module in any way.
And this is exactly why I started this topic, especially if libdeflate has Slicing-by-8 (I didn't know it) and 7za.dll has Igor's implementation (how it's calculated for 7z archives, otherwise) why do we still have so slow method for the actual checksum module.
2JOUBE
At least this is me.
Last edited by lelik007 on 2024-10-19, 16:11 UTC, edited 5 times in total.
Re: Faster CRC32 Method.
Without knowledge you could test only program performance which is in any case severely hindered by file system (most probably, synchronous disk I/O). This cannot be compared directly to method perfomance.
For the reference, average (or maybe higher than average) HDD might give 200 MB/s peak and substantially lower sustained rate for file operations.
300 MB/s suggests a very efficient implementation.
For the reference, average (or maybe higher than average) HDD might give 200 MB/s peak and substantially lower sustained rate for file operations.
300 MB/s suggests a very efficient implementation.
Last edited by browny on 2024-10-14, 16:36 UTC, edited 1 time in total.
Re: Faster CRC32 Method.
2browny
I'll show you then. 2 builds of RapidCRC Unicode with the only change log:
---
v0.3.40
posted on 03/29/2024 - 16:55
Changed crc implementation to slice-by-8 from https://github.com/stbrumme/crc32
Enabled SSE4.1 for Blake2sp on x64 builds
---
I asked the developer to make these changes because up to v0.3.39 there was compact tableless Assembly code.
[img]https://www.upload.ee/thumb/17231641/CRC32comparison.jpg[/img]
And 7-zip:
[img]https://www.upload.ee/thumb/17254822/7-zipCRC32.jpg[/img]
It's just an old code in TC and maybe Christian will decide to update it at some point I'm just showing what we'll get.
I'll show you then. 2 builds of RapidCRC Unicode with the only change log:
---
v0.3.40
posted on 03/29/2024 - 16:55
Changed crc implementation to slice-by-8 from https://github.com/stbrumme/crc32
Enabled SSE4.1 for Blake2sp on x64 builds
---
I asked the developer to make these changes because up to v0.3.39 there was compact tableless Assembly code.
[img]https://www.upload.ee/thumb/17231641/CRC32comparison.jpg[/img]
And 7-zip:
[img]https://www.upload.ee/thumb/17254822/7-zipCRC32.jpg[/img]
It's just an old code in TC and maybe Christian will decide to update it at some point I'm just showing what we'll get.
Re: Faster CRC32 method for SFV computation.
2ghisler(Author)
Thank you Christian, this is what I meant, now CRC32 has the same speed as BLAKE3 without AVX2 (if there's no this SIMD set in CPU) to ~ 1200 Mbyte/s, of course with AVX2 set BLAKE3 is ~ 1800 Mbyte/s.
Thank you Christian, this is what I meant, now CRC32 has the same speed as BLAKE3 without AVX2 (if there's no this SIMD set in CPU) to ~ 1200 Mbyte/s, of course with AVX2 set BLAKE3 is ~ 1800 Mbyte/s.
- ghisler(Author)
- Site Admin
- Posts: 50386
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Faster CRC32 method for SFV computation.
Yes, it's astonishing, I had to test on my primary M.2 SSD (4 lane PCIe 4.0) because it was faster than my data SSD (SATA, 500MB/s)...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Faster CRC32 method for SFV computation.
2ghisler(Author)
Igor Pavlov actually has a big brother: Slicing-by-12, 12 Kb Tables (Assembly Code with a table generator) but I don't know how to call this function out from 7z.dll.
https://www.upload.ee/image/17254822/7-zipCRC32.jpg
But as for me, I'll be more than happy with libdeflate's Slicing-by-8 because we'd already had it before I even asked and 1100-1200 Mbytes/s is a nice speed to have.
Igor Pavlov actually has a big brother: Slicing-by-12, 12 Kb Tables (Assembly Code with a table generator) but I don't know how to call this function out from 7z.dll.
https://www.upload.ee/image/17254822/7-zipCRC32.jpg
But as for me, I'll be more than happy with libdeflate's Slicing-by-8 because we'd already had it before I even asked and 1100-1200 Mbytes/s is a nice speed to have.
Last edited by lelik007 on 2024-11-21, 07:06 UTC, edited 3 times in total.
Re: Faster CRC32 method for SFV computation.
2ghisler(Author)
I think this topic can be closed as "Implemented" now.
I think this topic can be closed as "Implemented" now.
Re: Faster CRC32 method for SFV computation.
By editing the subject of the first post, you can change the subject of the thread yourself.
Re: [Implemented] Faster CRC32 method for SFV computation.
2white
Thank you, I understood and did it.
Thank you, I understood and did it.