Page 1 of 1
New algorithm for comparing too big files
Posted: 2009-10-20, 08:09 UTC
by ts4242
Comparing big size identical files take too much to complete, I suggest the following pseudo code
Code: Select all
IF the files size are equal AND greater than 500 MB THEN
create MD5_1 for file 1
create MD5_2 for file 2
IF MD5_1 = MD5_2 THEN
popup message "The two files are identical!"
ELSE
call compare_by_contents procedure
END IF
END IF
Currently:
It takes about 3 minutes to say two files of size 600 MB are identical
It takes about 5 minutes to say two files of size 1 GB are identical
Using above pseudo code
It will take about 20 seconds to say two files of size 600 MB are identical (10*2 seconds for creating MD5)
It will take about 46 seconds to say two files of size 1 GB are identical (23*2 seconds for creating MD5)
above timing are calculated with Win 7 64bit/ 2Ghz C2D/ 2GB Ram
What do you think?
Posted: 2009-10-20, 11:04 UTC
by MVV
I doubt that MD5 calculation for two memory blocks may be slower than just comparing theese blocks. Perhaps internal comparer's algorithm may be improved.
BTW, I suggest to you repeat your test and create MD5
before you compare files with compare tool.

In my case (file size is 891 MB, I've made hardlink for a file after calculation MD5 and compare file with its hard copy) MD5 calculations took 11 seconds and then comparing took 3 seconds! So I don't think internal compare tool is lazy. Don't forget about caching!

Posted: 2009-10-20, 13:03 UTC
by m^2
Comparison might be somewhat slower because TC has to read 2 streams at once, which causes unnecessary disk head moves. But with proper optimizations it shouldn't be a big issue and most importantly - it lets TC quit comparing as soon as it finds the first difference, which can save a lot of time.
Posted: 2009-10-20, 21:47 UTC
by ts4242
2MVV
Yes, performing CBC immediately after copying take less time than creating MD5 due to caching but this is rare case because CBC usually doesn't executed immediately after copying.
I think the suggested algorithm will save a lot of time specially in Synchronize directories tool. Imagine you are sync two folders each one contains 10 files of size 1GB, currently it takes 5*10=50 minutes against 46*10=460 seconds (7.5 minutes)
Edit:
I'm pretty sure that the suggested algorithm somehow used in find duplicate files, to say that, I did the following test
1- create folder F:\test
2- create 5 subfolders
3- copy a file of size 1.2 GB to each subfolder
4- search for duplicate files in F:\test (same size, same contents are checked)
5- TC takes about 2 minutes to list the five files
How this come? The answer which I'm imagine is TC doesn't compare files by contents but instead creates MD5 checksum for each file and consider files with identical checksum are duplicate.
calculate MD5 for 1.2GB takes 25 seconds, so 5*25 =125 seconds (2~ minutes)
2Ghisler
Am I right?
Is it possible to use the same technique in Compare By Contents and Synchronize directories tool?
Posted: 2009-10-21, 01:53 UTC
by Balderstrom
Sorry but your numbers aren't realistic and don't make sense.
I just hardlinked 39 files, totaling 14.4 Gigs (15,495,973,036 bytes)
Comparison of the two folders took 4 mins 25 secs
In your world that should of taken well over an hour?
Even if they were separate files, that wouldn't result in a 20 fold increase in time.
Posted: 2009-10-21, 03:23 UTC
by ts4242
2Balderstrom
I'm talking about comparing files and their copies not hardlinks.
hardlinks points to the same physical data, so why comparing?
Posted: 2009-10-21, 06:03 UTC
by MVV
ts4242,
I just wanted to say that
first operation is always slower than second etc, I mean both MD5 calculation and file comparing. So, you need to make test with no caching (e.g. immediately after system start or after cleaning cache).
New test: I've copied file (778 MB) onto another partition,
cleared disk cache, compared files,
cleared disk cache again and created MD5 for one file. Compare time was ~46 secs, MD5 calculation time ~17*2=~34 secs.
My own compare tool written specially for tests (uses 10-MB block size and memcmp function) did compare at 43 seconds, so result is almost as in TC. BTW, second compare (I modified byte at the end of file) was done in 3-4 seconds - cache had worked!

Changing block size to 20 MB doesn't decrease compare time. Also I noticed that cache clearing doesn't help in W7 - my comparer compares files at 2-3 seconds if they were compared before, but after cache clearing this time kept!

So, only reboot may help to be sure test is objective.
Posted: 2009-10-21, 12:50 UTC
by Balderstrom
Yeah much to my chagrin, it does take about 5 mins per gig when they are separate files.
I did a md5 test as well, as I was curious how accurate it could be with a 1 character difference in a GB+ file.
With one character changed, in 1.5 GB files, (near the middle of the file) they got very distinct md5 strings.
0114e51210e8debdc276bca210a7bec8 *Test.mbx
66a3718a4230ab3b5769e521f696f98f *Copy of Test.mbx
Generation of the MD5's took about 2 mins.
Also of note, File Compare tool is unable to do a side-by-side on those 2 files, due to lack of memory... I would of expected that tool to do partial load or load on demand to parse extremely large files.
I will agree this could be useful, though I've requested an MD5 comparison built into SyncTool previously and it was shotdown outright. I don't believe I've had a single suggestion that was well received by Mr.Ghisler (aside from the \Folder display string in File List). I fail to see a point in wasting time crafting a board message to present ideas.
Posted: 2009-10-21, 13:10 UTC
by m^2
MVV wrote:ts4242,
I just wanted to say that
first operation is always slower than second etc, I mean both MD5 calculation and file comparing. So, you need to make test with no caching (e.g. immediately after system start or after cleaning cache).
New test: I've copied file (778 MB) onto another partition,
cleared disk cache, compared files,
cleared disk cache again and created MD5 for one file. Compare time was ~46 secs, MD5 calculation time ~17*2=~34 secs.
My own compare tool written specially for tests (uses 10-MB block size and memcmp function) did compare at 43 seconds, so result is almost as in TC. BTW, second compare (I modified byte at the end of file) was done in 3-4 seconds - cache had worked!

Changing block size to 20 MB doesn't decrease compare time. Also I noticed that cache clearing doesn't help in W7 - my comparer compares files at 2-3 seconds if they were compared before, but after cache clearing this time kept!

So, only reboot may help to be sure test is objective.
memcmp is slow. Find some optimized version. Or - better - compare in a separate thread, while the first prepares the next pair of chunks.
What function did you use to read the files? FileRead?
Also, I've heard that memory mapped files are the fastest IO model for file archivers. This is somewhat similar and might work well. Also, it would surely work great with files on separate discs.
It would be nice to see the sources of a fast MD5 comparer. I'm sure TC could get close. If both used the same IO model:
Let's assume a pessimistic case of 5400 RPM laptop HDD with 12 ms latency and 100 MB/s reads. 108 MB block would get asymptotically 90% performance (really - somewhat better).
More realistic case:
7200 RPM desktop drive, 8.5 ms, 80 MB/s. 61.2 MB block gives the same 90% performance.
Posted: 2009-10-21, 13:53 UTC
by MVV
Balderstrom,
Yes, MD5 may show very different results if you change one byte.
But if you think, MD5 length is 32 characters. Let me round its length up to 32 bytes. So, if your file have size more than 64 bytes,
it is possible in theory to have two different files with the same MD5 hash! This thing is easy to understand. Few years ago some people found practical collisions,
Wikipedia also have a paragraph about files with same MD5.
Posted: 2009-10-23, 03:34 UTC
by ts4242
MVV wrote:it is possible in theory to have two different files with the same MD5 hash
Yes it is possible, I found
HERE two different EXE files has the same MD5,
Win32 executable HelloWorld-colliding.exe,
Win32 executable GoodbyeWorld-colliding.exe
But I think it is IMPOSSIBLE to have such files accidentally, they must be created by someone.
Posted: 2009-10-23, 05:17 UTC
by m^2
It is possible, though very unlikely (unless you're google, facebook or so).