How 'File compare Algorithm' works ?

English support forum

Moderators: white, Hacker, petermad, Stefan2

Post Reply
Waltertc
Junior Member
Junior Member
Posts: 4
Joined: 2017-06-06, 10:20 UTC

How 'File compare Algorithm' works ?

Post by *Waltertc »

Does anyone know how file compare algorithm works ?
I'd like to know in detail how text lines are compared.

kind regards,
Walter
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48079
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

TC first creates a list of checksums (hashes), one for each line. Then it matches equal hashes. When two lines are different, TC looks for the next matching line pair by comparing (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.

Once lines have been lined up, a similar method is used to match individual characters within a line.
Author of Total Commander
https://www.ghisler.com
Waltertc
Junior Member
Junior Member
Posts: 4
Joined: 2017-06-06, 10:20 UTC

Post by *Waltertc »

[face=courier]Thanks, that explains how differing lines are found.
How are the individual characters within a line matched ?

When I compare the below two files a match is found on "<description>" (columns 1 - 13 in both files) and on "tion" (columns 29-32 in the first file and columns 82-85 in the second file).
Why is there no match found on "1" or "i" or "C" or "</description>" ?

First file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48079
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.
Author of Total Commander
https://www.ghisler.com
Waltertc
Junior Member
Junior Member
Posts: 4
Joined: 2017-06-06, 10:20 UTC

Post by *Waltertc »

Thanks, but the character by character comparison is not clear to me yet.
This is what I get from the TC File > Compare by content.

[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>

There's a match on "<description>" and on "tion".

According to your algorithm I would expect :

First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>

Does some sort of grouping take place ?
[/face]
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48079
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Single characters are not matched. The square distances added are higher to match the two </description> strings than to match the "tion".
Author of Total Commander
https://www.ghisler.com
User avatar
tuska
Power Member
Power Member
Posts: 3760
Joined: 2007-05-21, 12:17 UTC

Post by *tuska »

Also interesting is how EmEditor Professional Version 16.9.2 solves this example:

[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>

Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48079
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Maybe it uses a bigger number of characters to match for re-synching.
Author of Total Commander
https://www.ghisler.com
Waltertc
Junior Member
Junior Member
Posts: 4
Joined: 2017-06-06, 10:20 UTC

Post by *Waltertc »

I'm not sure what you mean with "square distances added". Does TC use it's own method for comparing or is some other common known method used ?
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48079
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

It uses what I described in my second post (4th post in total above):

TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.

However, TC ignores single matching characters.
Author of Total Commander
https://www.ghisler.com
Post Reply