Compare by content (UTF8<->UTF8)

The behaviour described in the bug report is either by design, or would be far too complex/time-consuming to be changed

Moderators: white, Hacker, petermad, Stefan2

Post Reply
Ро&am
Junior Member
Junior Member
Posts: 6
Joined: 2009-05-03, 21:39 UTC

Compare by content (UTF8<->UTF8)

Post by *Ро&am »

If I try to "Compare by content" two UTF8 files in right and left windows - TC by default try to compare UTF8 (in left) with Ansi (in right window).
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

Cannot confirm. Maybe, your file in the right panel does not have UTF-8 BOM signature?
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
Ро&am
Junior Member
Junior Member
Posts: 6
Joined: 2009-05-03, 21:39 UTC

Post by *Ро&am »

Sory , you are right , in one of them I write BOM signatures (it was editet by Notepad).
If it is posible - delete this topic... :(
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48088
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Yes, this is a known limitation: Compare currently detects UTF-16 and UTF-8 only by BOM. Maybe someone has a better idea how to detect them? Notepad.exe manages to detect utf-8 also without BOM. The problem is that this detection needs to work with all languages...
Author of Total Commander
https://www.ghisler.com
Ро&am
Junior Member
Junior Member
Posts: 6
Joined: 2009-05-03, 21:39 UTC

Post by *Ро&am »

Notepad.exe have mistakes in detections...
For example - try to open notepad and save row:
this app can break
After them open file by notepad - it will mistake.

UltraEdit try to detect too... There are big problems - I have 3 times lost my text when open to edit .php file and after some simbols modification in header save it. File was in Ansi format , in bottom side have cyrilic text , but UltraEdit find word "UTF-8" in upper part of file and try to save it as utf8. As result - I lost all cyrilik text...

In one of my Delphi project I check text file if all characters is <chr(128)
or after 110xxxxx go simbol 10xxxxxx
or after 1110xxxx go 2 simbols 10xxxxxx
or after 11110xxx go 3 simbols 10xxxxxx
or after 111110xx go 4 simbols 10xxxxxx
then it is UTF8. If any exception - then it is not UTF8.
But it is needed to read file before show or compare. If file is very large - it is need time... :(

In situation when one of compared file have BOM but second haven't BOM - may be is handy to check first diference? Try to change codepage of firt diference in non-BOM windows to same as in file with BOM , and compare diferences - if matched - them big probability to need change to UTF## non-BOM file..
User avatar
ND
Member
Member
Posts: 150
Joined: 2006-04-10, 16:24 UTC
Location: Sibiu, RO

Post by *ND »

ghisler(Author) wrote:Maybe someone has a better idea how to detect them?
Wikipedia suggests that a detection algorithm for UTF-8 content would be easy to implement.

I don't know how Delphi supports regular expressions (the linked page given on Wikipedia contains a Perl regexp), but from the code comments it should be easy to implement in any programming language. There may be problems with large files though (you'd have to parse the entire file to determine "UTF-8"-ness).
aNDreas Bolotă
The truth always carries the ambiguity of the words used to express it. (Frank Herbert, God Emperor of Dune)
User avatar
Hacker
Moderator
Moderator
Posts: 13067
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Post by *Hacker »

Ро&am wrote:In situation when one of compared file have BOM but second haven't BOM - may be is handy to check first diference? Try to change codepage of firt diference in non-BOM windows to same as in file with BOM , and compare diferences - if matched - them big probability to need change to UTF## non-BOM file..
I find this a very good idea.

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
Post Reply