Page 1 of 2
Search duplicates by same content very very slow
Posted: 2015-11-19, 11:00 UTC
by maverick1999
Finding duplicates or comparing two files located on the same drives appears to be very slow on some hard drives (specially laptop disks get less than 2MB/s)
Possible reason: TC reads a very small buffer from each file and compares it.
This translates into random disk access which is very slow on many disks.
Possible solution:
(1). BEST: add "same checksum" after "same content" option in the Search window and allow MD5, SHA1 and SHA512 selection.
Please see FreeCommanderXE checksum option in the Synchronize window:
freecommander DOT com/fchelpxe/en/lib/NewItem265.png
(2). add read buffer size in configuration Options
Making a checksum file and verifying it with the other file for example is 50-100 times faster than comparing by content.
Thank you.
Posted: 2015-11-19, 11:05 UTC
by ghisler(Author)
TC uses this method with checksums only if there are 3 or more files of the same size. Why? If you have two 2GB files, and there is a difference in the first byte, my method is almost instantaneous. Your method takes a lot of time.
Posted: 2015-11-19, 12:11 UTC
by maverick1999
If you have 2GB files that are identical then TC will read alternatively from each file trashing the hard drive until the very last byte.
(stopping, repositioning the heads and waiting for the disk to spin after each small buffer read)
Posted: 2015-11-20, 13:40 UTC
by MVV
It depends on compare buffer size...
I think it might be good to configure it. I think 64 MB would be pretty enough, but it may be really faster to read e.g. first 1 MB of files separately.
Posted: 2015-11-23, 10:49 UTC
by ghisler(Author)
I did try with 1 MB buffer, but it was actually slower. Maybe it depends on the harddisk. Modern harddisks use some kind of look ahead buffering to make this faster.
Posted: 2015-11-23, 11:34 UTC
by MVV
ghisler,
I think that buffer size should be much larger than even 1 MB for modern HDDs (with buffer sizes up to 128 MB), also you can make it configurable (and maybe even auto-detectable by default depending on available free RAM).
The more we read in a single pass, the faster we compare data when both blocks fit in the RAM.
Posted: 2015-11-23, 14:24 UTC
by milo1012
I agree that the buffer size should be configurable.
But please no auto-detect.
MVV wrote:I think that buffer size should be much larger than even 1 MB for modern HDDs
It's just wrong that you need a (very) large buffer for modern HDDs (I'm not talking about SSDs).
You can't predict behavior just by the HDD cache. Cache doesn't matter at all if the two files lie on completely different physical positions.
HDD controllers work very different than you'd think.
It's pretty much the other way around with SSDs. You should really use a buffer >= 1 MB for those.
Additionally, the Windows file cache may interfere with this.
Posted: 2015-11-23, 16:14 UTC
by MVV
I just think that HDD reading speed won't be lower with larger block size until buffer fits in both RAM and HDD cache, especially in case of defragmented file.
If block is larger than HDD cache, it may require additional HDD turn to be read fully (I'm not talking about SSDs).
Ok, no auto-detects.

But I don't think that it will be too bad to auto-select e.g. buffer size 2 MB for PCs with 1+ GB RAM and some smaller value otherwise.
Posted: 2015-11-23, 18:54 UTC
by milo1012
MVV wrote:But I don't think that it will be too bad to auto-select e.g. buffer size 2 MB for PCs with 1+ GB RAM and some smaller value otherwise.
Agreed.
(and maybe at the same time we can get a
configurable memory limit for the CBC edit mode)
Posted: 2015-11-23, 23:04 UTC
by maverick1999
Maybe a predefined list of buffer sizes (32,64,128,256,..) and a "compare by checksum" option could be implemented and tested for some time to get feedback.
There are so many disk types (hard, hybrid, different internal caches, spinning speeds, etc).
Posted: 2015-11-23, 23:10 UTC
by Dalai
Support++ for a "Compare by checksum" option! This would give everyone the opportunity to choose the method according to the current scenario (different partitions on the same disk, different disks, network and so on).
Regards
Dalai
Posted: 2015-11-24, 07:06 UTC
by MVV
As Christian stated, TC already uses checksum comparing when number of files to compare is greater than 2, and when the 2 files differ, regular compare method may be really faster.
I don't think that standard checksum comparison may be faster than comparing 2 files block by block in case of a good block size. Howerer I think that comparing block checksums may be interesting (I mean: read two blocks, calculate checksums, compare them, then read next two blocks etc so we don't need to compare large memory blocks but only their small checksums).
I think HDD reading speed is a major factor so we should really play with block sizes.
I've written a simple program that compares two 400-MB memory buffers 128 times using simple byte-by-byte compare function and standard C memcmp function:
Simple compare function compares about 500 MB per second
memcmp (which compares at least DWORDs in case of aligned memory) compares about 4-5 GB per second
(I've tried with 2-MB and 400-MB compare block sizes, no matter)
So we could get similar speed if only HDD weren't such slow, and of course if TC memory compare function is as fast as memcmp.
Posted: 2015-11-24, 08:27 UTC
by maverick1999
@MVV:
the slowdown comes from disk head repositioning (not from memcmp or byte by byte compare method).
When you checksum you read the file in a predictive way for the drive.
NOTE***:
The biggest problem comes when searching for DUPLICATES on the same drive.
You end up comparing 2 identical files byte by byte and seeking the drive heads like crazy between the 2 files.
In this case checksuming or large buffer is way much faster.
Posted: 2015-11-24, 11:31 UTC
by Dalai
2MVV
The issue is random access to the disk (reading two files at the same time) rather than linear access (reading one file after the other). That's what makes the comparison process so slow and thrashes the disk. So, the goal should be to make the HDD read files linearly instead of randomly. It doesn't really matter how the comparison is done in memory once the data has been read from the disk.
Regards
Dalai
Posted: 2015-11-24, 13:27 UTC
by MVV
maverick1999,
Dalai,
If you use large block size, head repositionings (and slowdowns caused by them) are insignificant. Remember TC big file copy mode which was good in copying files within same HDD just because of large block size.
But as you can see from my quick test, even memory compare may be done in different ways, and two methods may be quite different in speed (up to 10 times).
BTW simple DWORD comparing function compares 2 GB per second, it is faster than simple BYTE comparing function but still slower than assembly-optimized
memcmp.
Hm, calculating vectors of checksums for file blocks may improve compare speed in case of multiple large files.
