Compare by content (UTF8<->UTF8)
Moderators: Hacker, petermad, Stefan2, white
Compare by content (UTF8<->UTF8)
If I try to "Compare by content" two UTF8 files in right and left windows - TC by default try to compare UTF8 (in left) with Ansi (in right window).
Cannot confirm. Maybe, your file in the right panel does not have UTF-8 BOM signature?
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
Using TC 11.03 / Win10 x64
Using TC 11.03 / Win10 x64
- ghisler(Author)
- Site Admin
- Posts: 50475
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Yes, this is a known limitation: Compare currently detects UTF-16 and UTF-8 only by BOM. Maybe someone has a better idea how to detect them? Notepad.exe manages to detect utf-8 also without BOM. The problem is that this detection needs to work with all languages...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Notepad.exe have mistakes in detections...
For example - try to open notepad and save row:
this app can break
After them open file by notepad - it will mistake.
UltraEdit try to detect too... There are big problems - I have 3 times lost my text when open to edit .php file and after some simbols modification in header save it. File was in Ansi format , in bottom side have cyrilic text , but UltraEdit find word "UTF-8" in upper part of file and try to save it as utf8. As result - I lost all cyrilik text...
In one of my Delphi project I check text file if all characters is <chr(128)
or after 110xxxxx go simbol 10xxxxxx
or after 1110xxxx go 2 simbols 10xxxxxx
or after 11110xxx go 3 simbols 10xxxxxx
or after 111110xx go 4 simbols 10xxxxxx
then it is UTF8. If any exception - then it is not UTF8.
But it is needed to read file before show or compare. If file is very large - it is need time...
In situation when one of compared file have BOM but second haven't BOM - may be is handy to check first diference? Try to change codepage of firt diference in non-BOM windows to same as in file with BOM , and compare diferences - if matched - them big probability to need change to UTF## non-BOM file..
For example - try to open notepad and save row:
this app can break
After them open file by notepad - it will mistake.
UltraEdit try to detect too... There are big problems - I have 3 times lost my text when open to edit .php file and after some simbols modification in header save it. File was in Ansi format , in bottom side have cyrilic text , but UltraEdit find word "UTF-8" in upper part of file and try to save it as utf8. As result - I lost all cyrilik text...
In one of my Delphi project I check text file if all characters is <chr(128)
or after 110xxxxx go simbol 10xxxxxx
or after 1110xxxx go 2 simbols 10xxxxxx
or after 11110xxx go 3 simbols 10xxxxxx
or after 111110xx go 4 simbols 10xxxxxx
then it is UTF8. If any exception - then it is not UTF8.
But it is needed to read file before show or compare. If file is very large - it is need time...

In situation when one of compared file have BOM but second haven't BOM - may be is handy to check first diference? Try to change codepage of firt diference in non-BOM windows to same as in file with BOM , and compare diferences - if matched - them big probability to need change to UTF## non-BOM file..
Wikipedia suggests that a detection algorithm for UTF-8 content would be easy to implement.ghisler(Author) wrote:Maybe someone has a better idea how to detect them?
I don't know how Delphi supports regular expressions (the linked page given on Wikipedia contains a Perl regexp), but from the code comments it should be easy to implement in any programming language. There may be problems with large files though (you'd have to parse the entire file to determine "UTF-8"-ness).
aNDreas Bolotă
The truth always carries the ambiguity of the words used to express it. (Frank Herbert, God Emperor of Dune)
The truth always carries the ambiguity of the words used to express it. (Frank Herbert, God Emperor of Dune)
I find this a very good idea.Ро&am wrote:In situation when one of compared file have BOM but second haven't BOM - may be is handy to check first diference? Try to change codepage of firt diference in non-BOM windows to same as in file with BOM , and compare diferences - if matched - them big probability to need change to UTF## non-BOM file..
Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.