How 'File compare Algorithm' works ?

Waltertc · Post by *Waltertc » 2017-06-06, 10:31 UTC

Does anyone know how file compare algorithm works ?
I'd like to know in detail how text lines are compared.

kind regards,
Walter

Post by *ghisler(Author) » 2017-06-08, 09:01 UTC

TC first creates a list of checksums (hashes), one for each line. Then it matches equal hashes. When two lines are different, TC looks for the next matching line pair by comparing (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.

Once lines have been lined up, a similar method is used to match individual characters within a line.

Waltertc · Post by *Waltertc » 2017-06-08, 10:00 UTC

[face=courier]Thanks, that explains how differing lines are found.
How are the individual characters within a line matched ?

When I compare the below two files a match is found on "<description>" (columns 1 - 13 in both files) and on "tion" (columns 29-32 in the first file and columns 82-85 in the second file).
Why is there no match found on "1" or "i" or "C" or "</description>" ?

First file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]

Post by *ghisler(Author) » 2017-06-12, 13:55 UTC

TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.

Waltertc · Post by *Waltertc » 2017-06-13, 08:46 UTC

Thanks, but the character by character comparison is not clear to me yet.
This is what I get from the TC File > Compare by content.

[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>

There's a match on "<description>" and on "tion".

According to your algorithm I would expect :

First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>

Does some sort of grouping take place ?
[/face]

Post by *ghisler(Author) » 2017-06-15, 09:03 UTC

Single characters are not matched. The square distances added are higher to match the two </description> strings than to match the "tion".

tuska · Post by *tuska » 2017-06-16, 12:07 UTC

Also interesting is how EmEditor Professional Version 16.9.2 solves this example:

[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]

Post by *ghisler(Author) » 2017-06-20, 08:51 UTC

Maybe it uses a bigger number of characters to match for re-synching.

Waltertc · Post by *Waltertc » 2017-06-21, 14:22 UTC

I'm not sure what you mean with "square distances added". Does TC use it's own method for comparing or is some other common known method used ?

Post by *ghisler(Author) » 2017-06-22, 08:59 UTC

It uses what I described in my second post (4th post in total above):

TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.

However, TC ignores single matching characters.