How 'File compare Algorithm' works ?
Moderators: white, Hacker, petermad, Stefan2
How 'File compare Algorithm' works ?
Does anyone know how file compare algorithm works ?
I'd like to know in detail how text lines are compared.
kind regards,
Walter
I'd like to know in detail how text lines are compared.
kind regards,
Walter
- ghisler(Author)
- Site Admin
- Posts: 48079
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
TC first creates a list of checksums (hashes), one for each line. Then it matches equal hashes. When two lines are different, TC looks for the next matching line pair by comparing (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.
Once lines have been lined up, a similar method is used to match individual characters within a line.
Once lines have been lined up, a similar method is used to match individual characters within a line.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
[face=courier]Thanks, that explains how differing lines are found.
How are the individual characters within a line matched ?
When I compare the below two files a match is found on "<description>" (columns 1 - 13 in both files) and on "tion" (columns 29-32 in the first file and columns 82-85 in the second file).
Why is there no match found on "1" or "i" or "C" or "</description>" ?
First file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
How are the individual characters within a line matched ?
When I compare the below two files a match is found on "<description>" (columns 1 - 13 in both files) and on "tion" (columns 29-32 in the first file and columns 82-85 in the second file).
Why is there no match found on "1" or "i" or "C" or "</description>" ?
First file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
- ghisler(Author)
- Site Admin
- Posts: 48079
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Thanks, but the character by character comparison is not clear to me yet.
This is what I get from the TC File > Compare by content.
[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>
There's a match on "<description>" and on "tion".
According to your algorithm I would expect :
First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>
Does some sort of grouping take place ?
[/face]
This is what I get from the TC File > Compare by content.
[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>
There's a match on "<description>" and on "tion".
According to your algorithm I would expect :
First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>
Does some sort of grouping take place ?
[/face]
- ghisler(Author)
- Site Admin
- Posts: 48079
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Single characters are not matched. The square distances added are higher to match the two </description> strings than to match the "tion".
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Also interesting is how EmEditor Professional Version 16.9.2 solves this example:
[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
[face=courier]First (left) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>v6.1 SQL Connection string added.</description>
Second (right) file :
1...5....10...5....20...5....30...5....40...5....50...5....60...5....70...5....80...5.
<description>2017-02-01 Nieuwe versie op basis van nieuwe UNSPSC module.</description>[/face]
- ghisler(Author)
- Site Admin
- Posts: 48079
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Maybe it uses a bigger number of characters to match for re-synching.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
- ghisler(Author)
- Site Admin
- Posts: 48079
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
It uses what I described in my second post (4th post in total above):
TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.
However, TC ignores single matching characters.
TC compares the two lines character by character until a mismatch is found. Then it uses the same algorithm as it does for lines to find the next matching pair of characters: (left+1, right), (left, right+1), (left+2, right), (left+2, right+1), etc. until a match is found.
However, TC ignores single matching characters.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com