Could you add xxHash (xxh3, the fastest hash algorithm)?
Moderators: white, Hacker, petermad, Stefan2
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
I have now added BLAKE3 to Total Commander 10.50 beta 1. Unfortunately the C version is not multi-threading, but it's already much faster than all the other hash algorithms. The multi-threaded version is written in Rust, a language I don't have any experience with. I don't even know whether it's possible to create DLLs with it.
Regarding xxHash, I have read some more on it, and the relatively short hash of 32 or 64 bit is meant for a hash table, not for verifying the integrity of a file. Why? In a hash table, collisions (same hash for different inputs) are expected and handled, either by using the next free slot in the hash table, or by chaining (hash table linking to multiple entries). xxHash3 also seems to support a 128 bit hash, but it's relatively new - I don't know the exact internals.
Here is an interesting analysis of xxHash:
https://encode.su/threads/2556-Improving-xxHash
I don't know if any of these improvements have been included in xxHash2 or xxHash3, but if not, then xxHash isn't suited for verifying the integrity of files. At least xxhash64bit seems to use 4 64-bit accumulators now, which would be a big improvement:
https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md
Unfortunately the document still doesn't explain the inner workings of xxhash128, although it was requested 3 months ago. Does anyone here have details on the xxhash128 algorithm?
Regarding xxHash, I have read some more on it, and the relatively short hash of 32 or 64 bit is meant for a hash table, not for verifying the integrity of a file. Why? In a hash table, collisions (same hash for different inputs) are expected and handled, either by using the next free slot in the hash table, or by chaining (hash table linking to multiple entries). xxHash3 also seems to support a 128 bit hash, but it's relatively new - I don't know the exact internals.
Here is an interesting analysis of xxHash:
https://encode.su/threads/2556-Improving-xxHash
I don't know if any of these improvements have been included in xxHash2 or xxHash3, but if not, then xxHash isn't suited for verifying the integrity of files. At least xxhash64bit seems to use 4 64-bit accumulators now, which would be a big improvement:
https://github.com/Cyan4973/xxHash/blob/dev/doc/xxhash_spec.md
Unfortunately the document still doesn't explain the inner workings of xxhash128, although it was requested 3 months ago. Does anyone here have details on the xxhash128 algorithm?
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
if you want to get a good comparison benchmark, you can give a look at : SMHasher.
For xxh128, algorithm you should give a look to function XXH3_hashLong_128b inside reference implementation in C. This is a reference implementation and does not have performance optimization of the official release 0.8.1.
For a textual definition of xxhash you can read documentation to what i have read, this is so general that it should also apply to xxh128.
For xxh128, algorithm you should give a look to function XXH3_hashLong_128b inside reference implementation in C. This is a reference implementation and does not have performance optimization of the official release 0.8.1.
For a textual definition of xxhash you can read documentation to what i have read, this is so general that it should also apply to xxh128.
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
Unfortunately the documentation doesn't cover xxhash 128 bit yet. This was also mentioned in the github bug reports.
From the source, it seems to use two 64-bit accumulators if I understand it correctly.
From the source, it seems to use two 64-bit accumulators if I understand it correctly.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
From the table at https://github.com/rurban/smhasher
I see:
xxh3low, xxh128low, xxh128, xxHash64. These have no quality issues listed. The first is fastest, last is slowest.
Also CityCrc128 is a good pick.
Any cryptographic hash like Blake will be slower than non-crypto hashes. They have a lot more cpu cycles. Multi-threading matters of course. It sort-of seems xxh has multi-threading, but it's not clear and I don't see any command line setting (like Blake3).
I see:
xxh3low, xxh128low, xxh128, xxHash64. These have no quality issues listed. The first is fastest, last is slowest.
Also CityCrc128 is a good pick.
Any cryptographic hash like Blake will be slower than non-crypto hashes. They have a lot more cpu cycles. Multi-threading matters of course. It sort-of seems xxh has multi-threading, but it's not clear and I don't see any command line setting (like Blake3).
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
I have read a lot about these hashes before deciding to add Blake.
As I understand it, xxHash32 and xxHash64 are not suited for file hashes because their room is too small. They are meant as a traditional hash function, where you have to re-hash in case of a hash collision. There is no such workaround when there is a hash collision with two different files. xxHash128 may be suited for file hashing, but I cannot find any detailed description of the used hash algorithm. If it just uses two xxHash64 hashes, then it is definitely NOT suited for file hashing! The limiting factor isn't the output size, it's the size of the accumulators within the hash algorithm. They need to be at least 128 bit wide too, otherwise you lose entropy.
As I understand it, xxHash32 and xxHash64 are not suited for file hashes because their room is too small. They are meant as a traditional hash function, where you have to re-hash in case of a hash collision. There is no such workaround when there is a hash collision with two different files. xxHash128 may be suited for file hashing, but I cannot find any detailed description of the used hash algorithm. If it just uses two xxHash64 hashes, then it is definitely NOT suited for file hashing! The limiting factor isn't the output size, it's the size of the accumulators within the hash algorithm. They need to be at least 128 bit wide too, otherwise you lose entropy.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
2ghisler(Author)
And what about prvhash64s and Discohash that I mentioned earlier? They are even faster. It would be great to use them for comparison and synchronization.
And what about prvhash64s and Discohash that I mentioned earlier? They are even faster. It would be great to use them for comparison and synchronization.
Overquoting is evil! 👎
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
PRVHASH is patented, at least in Russia, according to the linked document. The document describes that it's suited as a pseudo random number generator, but does not mention file hashing.
Discohash looks interesting, but it's very new (2020) and unproven, and the 128-bit hash seems to be created from two 64-bit hashes, so it's questionable whether the internal entropy is really 128 bit, or just 64-bit.
Discohash looks interesting, but it's very new (2020) and unproven, and the 128-bit hash seems to be created from two 64-bit hashes, so it's questionable whether the internal entropy is really 128 bit, or just 64-bit.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
Tested, only the results are so-so.
According to SMHasher tests, these functions are the fastest of the reliable ones:
Hash function | MiB/sec | cycl./hash | cycl./map | size | Quality problems |
xxh3low | 16462.36 | 32.77 | 199.79 (2) | 756 | |
halftime_hash128 | 13478.23 | 97.79 | 252.14 (2) | 2462 | |
halftime_hash256 | 11620.28 | 98.44 | 252.60 (2) | 2622 | |
halftime_hash512 | 7681.62 | 125.81 | 274.01 (3) | 3550 | |
wyhash32low | 12911.09 | 29.59 | 205.43 (2) | 474 | 2 bad seeds |
komihash | 10444.53 | 39.55 | 176.50 (1) | 728 | |
ahash64 | 9862.62 | 27.32 | 181.68 (1) | 412 | rust |
Spooky32 | 9747.13 | 62.24 | 196.96 (4) | 2221 | UB |
Spooky64 | 9747.47 | 62.20 | 191.71 (2) | 2221 | UB |
Spooky128 | 9751.14 | 63.84 | 192.47 (2) | 2221 | UB |
t1ha2_atonce | 9237.12 | 38.94 | 194.32 (2) | 541 | Zeroes low3 |
t1ha2_atonce128 | 8350.99 | 55.65 | 203.53 (4) | 613 | LongNeighbors |
pengyhash | 8744.48 | 85.31 | 222.45 (4) | 421 | |
nmhash32 | 7850.01 | 56.74 | 207.59 (1) | 2445 | |
mx3 | 6146.02 | 52.48 | 173.09 (3) | 734 | UB |
Maybe take a look at some of them? HalftimeHash, komihash, Spooky for example?
Overquoting is evil! 👎
- ghisler(Author)
- Site Admin
- Posts: 48083
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
Another problem is that 99.999% of people have never heared of any of these hashes, except for maybe xxhash.
The problem is that the only variant interesting for file hashing, xxhash128, is still not documented:
https://github.com/Cyan4973/xxHash/issues/589
This makes its quality very questionable. Even the stone-age hash md5 creates 128 bit hashes and works internally with a 128 bit accumulator (4x32 bit) with good entropy. It is broken cryptographically, but that doesn't matter if you only want to verify integrity (that the file wasn't damaged).
The problem is that the only variant interesting for file hashing, xxhash128, is still not documented:
https://github.com/Cyan4973/xxHash/issues/589
This makes its quality very questionable. Even the stone-age hash md5 creates 128 bit hashes and works internally with a 128 bit accumulator (4x32 bit) with good entropy. It is broken cryptographically, but that doesn't matter if you only want to verify integrity (that the file wasn't damaged).
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
I have to agree here that commonly known and independently assessed hash functions should be preferred. Remember, these would be used in TC for a long time, it's not like we find a better hash so we replace the previous one the next month.
Roman
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
Roman, Christian, the hash functions are not only used for hash calculation from the user but also internally.
As hash for content compare or to verify copy is used internally (to my knowledge, it is md5). This should be as efficient as possible and not be exposed outside to calculate hash files if you think that it is not generally relevant.
tc.hash_<Name> wdx like column could be nice.
fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
As hash for content compare or to verify copy is used internally (to my knowledge, it is md5). This should be as efficient as possible and not be exposed outside to calculate hash files if you think that it is not generally relevant.
tc.hash_<Name> wdx like column could be nice.
fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
nsp,
Roman
I don't believe a hash can be slower than reading from an external drive. If it is, I agree, it should not be used.fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
This is not a problem for the tasks I have indicated with a significant TC acceleration during comparison and synchronization. If, in addition, computational options for constructing hash sums appear, then this will stick only to the choice of users, and the prevalence of these hashes, as well as once popular ones, will be associated exclusively with their use in services and popular applications like Total Commander. No hash would have become popular if it had not become widespread, thanks to well-known software. I think there's no point in waiting for others to do it sooner.ghisler(Author) wrote: ↑2022-06-01, 07:06 UTCAnother problem is that 99.999% of people have never heared of any of these hashes
OFF: Please answer in this topic.
Overquoting is evil! 👎
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
We use usb3 or ThunderBolt nvme SSD external drives (faster than the internal HD of my work laptop )Hacker wrote: ↑2022-06-01, 11:34 UTC nsp,I don't believe a hash can be slower than reading from an external drive. If it is, I agree, it should not be used.fclones which i use on linux is much faster than TC on same external drive to find duplicates. To me, this is probably due to efficiency of the hash used (murmurhash vs md5).
Roman
You can compare yourself time used to calculate md5sum with TC on a big file and time used by FCHash for xxhash3.
Re: Could you add xxHash (xxh3, the fastest hash algorithm)?
Fla$her,
nsp,
TC Blake3 - 11 sec
TC MD5 - 29 sec
So it seems the best solution at least for now would be to switch to Blake3 for TC's internal functions?
Roman
The hashes still have to be reliable, i.e. well documented, tested and at least a bit time-proven.This is not a problem for the tasks I have indicated with a significant TC acceleration during comparison and synchronization.
nsp,
FcHash xxhash3 - 12 secYou can compare yourself time used to calculate md5sum with TC on a big file and time used by FCHash for xxhash3.
TC Blake3 - 11 sec
TC MD5 - 29 sec
So it seems the best solution at least for now would be to switch to Blake3 for TC's internal functions?
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.