Search duplicates by same content very very slow
Moderators: Hacker, petermad, Stefan2, white
-
- Junior Member
- Posts: 8
- Joined: 2015-11-19, 10:35 UTC
It is a commonplace that the low-level functions (memcmp/memmove/memcpy) scratch on the theoretical memory bandwidth.MVV wrote:But as you can see from my quick test, even memory compare may be done in different ways, and two methods may be quite different in speed (up to 10 times).
(see e.g. http://nadeausoftware.com/articles/2012/05/c_c_tip_how_copy_memory_quickly )
Using the checksum method isn't such a bad idea.
The problem is that you need a good hash function (SHA-224 and above),
but those functions have calculation performance of ~100-200 MB in a single thread,
and you even need to half the base performance, as you need to do it for both files.
TC already uses SHA-1 for comparing files from archives and the mentioned 3 or more files of the same size,
but the calc. performance also limits such comparisons in case of fast SSDs.
In any case, making the CBC compare operations configurable can't hurt, and would help in cases like the OP.
TC plugins: PCREsearch and RegXtract
I would try hash-per-block method for comparing large files by contents, just to be able to detect difference earlier than after reading both files entirely.
When you search for duplicates, you can already use any checksum compare method using corresponding WDX plugin (e.g. http://totalcmd.net/plugring/wdhash.html).
When you search for duplicates, you can already use any checksum compare method using corresponding WDX plugin (e.g. http://totalcmd.net/plugring/wdhash.html).
-
- Junior Member
- Posts: 8
- Joined: 2015-11-19, 10:35 UTC
-
- Junior Member
- Posts: 8
- Joined: 2015-11-19, 10:35 UTC
I only suggest to compare speed of comparing by contents or by hashes, you can do that already.
There are another plugins that have 64-bit versions, just use the search. E.g. LotsOfHashes.
There are another plugins that have 64-bit versions, just use the search. E.g. LotsOfHashes.
It's understandable for a (mechanical) HDD.maverick1999 wrote:On a laptop with 2 identical files located on the same drive the current speed never goes above 7MB/s. (the drive has a 32MB buffer).
SHA-512 however is limited only by disk read speed (~100MB/s) so the bottleneck is not the hash function.
And like I said: I'm all for a configurable buffer, and/or using a checksum/hash compare.
But, in case of a fast SSD, the bottleneck will be the hash function.
Just try to compare files from solid rar archives with TC's "Synchronize dirs" function, where TC uses SHA-1.
I barely get more than ~140 MB/s on my test system, even though the SSD would easily reach ~400 MB/s.
So when looking for duplicates of large files on today's fast (SSD) drives, using a hash is probably not the best solution,
even if TC's SHA-1 implementation would get a slight speed boost.
I'm not sure if you can read drive properties from the WinAPI, i.e. identify SSD drives in a non-ambiguous way.
But if you can, TC could use checksum/hash compare for mechanical HDDs, and for SSDs the buffered binary compare that we have now.
And of course it would be best to still let the user configure which method he wants to use, including the mentioned buffer size.
Update: there actually are some ways to detect non-mechanical drives:
http://stackoverflow.com/questions/9273373/tell-if-a-path-refers-to-a-solid-state-drive-with-winapi
http://stackoverflow.com/questions/23363115/detecting-ssd-in-windows
TC plugins: PCREsearch and RegXtract
No. The first file is decompressed, and the hash is calculated after that (you can see it by the lower status bar).MVV wrote:I think that the bottleneck in such case is RAR decompressor
Then the second file is decompressed, and that hash will be calculated.
I measured e.g. ~15 seconds for a 2GB file repeatedly, just for the hash calculation, and not counting decompression.
It's the typical calculation performance for SHA-1, as you can also see by my linked article.
I doubt that you would ever get much faster than ~200 MB/s on current machines, even with optimized code.
And yes, this is the same hash function that is optionally used for duplicates search.
TC plugins: PCREsearch and RegXtract
-
- Junior Member
- Posts: 8
- Joined: 2015-11-19, 10:35 UTC