[Implemented] Faster CRC32 method for SFV computation.
Posted: 2024-10-11, 11:09 UTC
Christian I'm sorry for maybe being too late for this 11.50 beta but I should address this. TC uses as I think standard CRC32 implementation by Dilip V. Sarwate (maybe in Assembly) for SFV computation and it gives to 375 Mbyte/s according to this https://create.stephan-brumme.com/crc32 and my personal measurements.
So why not to speed CRC32 up in TC? You can use Slicing-by-N for example.
If I understand the things right the method with the sliced tables was brought by Intel and initially was for CRC32c(Castagnoli) but then was adopted for CRC32:
https://sourceforge.net/projects/slicing-by-8/
Slicing-by-N algos can contain literally the lookup tables in their code - table-based: Zlib, libdeflate (crc32_tables.h for example), Fast CRC32 by Stephan Brumme or tableless where the lookup tables are sliced yet calculated by the algos: 7-zip - Slicing-by-12, UnRAR- Slicing-by-16.
But every method based on Intel Slicing-by-8 gives ~ 1000 Mbyte/s.
Any Slicing-by-N method with Stephan Brumme's optimizations is available here. He's a relatively known researcher and a collector of CRC32 methods:
https://github.com/stbrumme/crc32
And as browny said below libdeflate and Zlib use their own methods based on Intel Slicing-by-8 similar to Fast CRC32 by Stephan Brumme with just the same lookup tables.
You can choose either one.
As for x32 I don't see Slicing-by-8 does any harm for RapidCRC Unicode x32 though it performs a bit slower ~ 900 Mbyte/s. In this case I think maybe even Slicing-by-4 could be a better option or just an old code. For x64 and the modern machines Slicing-by-16 can give a better performance but Slicing-by-8 is more conservative.
So why not to speed CRC32 up in TC? You can use Slicing-by-N for example.
If I understand the things right the method with the sliced tables was brought by Intel and initially was for CRC32c(Castagnoli) but then was adopted for CRC32:
https://sourceforge.net/projects/slicing-by-8/
Slicing-by-N algos can contain literally the lookup tables in their code - table-based: Zlib, libdeflate (crc32_tables.h for example), Fast CRC32 by Stephan Brumme or tableless where the lookup tables are sliced yet calculated by the algos: 7-zip - Slicing-by-12, UnRAR- Slicing-by-16.
But every method based on Intel Slicing-by-8 gives ~ 1000 Mbyte/s.
Any Slicing-by-N method with Stephan Brumme's optimizations is available here. He's a relatively known researcher and a collector of CRC32 methods:
https://github.com/stbrumme/crc32
And as browny said below libdeflate and Zlib use their own methods based on Intel Slicing-by-8 similar to Fast CRC32 by Stephan Brumme with just the same lookup tables.
You can choose either one.
As for x32 I don't see Slicing-by-8 does any harm for RapidCRC Unicode x32 though it performs a bit slower ~ 900 Mbyte/s. In this case I think maybe even Slicing-by-4 could be a better option or just an old code. For x64 and the modern machines Slicing-by-16 can give a better performance but Slicing-by-8 is more conservative.