Compare by content and Unicode CJK symbols

Please report only one bug per message!

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Compare by content and Unicode CJK symbols

Post by *milo1012 »

While writing some code for Unicode handling I often rely on TCs "Compare by Content"
when comparing Text with rare Unicode characters.
While doing some tests with some of them for myself,
I found that TC seems to fail comparing them in different encodings:

Code: Select all

𤽜𤹜𤵜𤱜
(for clarification: it's U+24F5C U+24E5C U+24D5C U+24C5C)

Just use one or all four characters, write them to an e.g. UTF-16 file.
(just these characters, no additional ASCII chars or others)
Now use the same character sequence and encode it to an UTF-8 file.
Do a Compare by Content for both files in Text mode and:

TC highlights them as different line(s), although they aren't!

It even fails when comparing BOM-less UTF-16 with BOM-included UTF-16,
i.e. the files just differ in the missing "xFFFE" at the beginning.
(of course, I have to set the encoding of the BOM-less file manually, otherwise it's seen as binary)

(Tested on TC 8.01 32bit, Win 7 64bit
older versions and TC64 not tested so far)
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48083
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

These are 4 byte Unicode characters, which take two instead of one 16-bit Unicode value each. Total Commander does not currently support them yet. You can see that when you turn on the cursor with F6 and try to step through these characters with the arrow keys. TC will step also in the middle of such a character. I will try to support them, but cannot promise anything yet.
Author of Total Commander
https://www.ghisler.com
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

I see. So you say a UTF-16 surrogate pair isn't supported yet. Good to know,
since it's quite a lot of characters (all that are not in the Basic Multilingual Plane)
which aren't comparable that way (yet).
TC will step also in the middle of such a character.
Confirmed.
Also and strangely, TC will insert two replacement characters  for the first CJK symbol.

But strangely when I do sth. like this:
test𤽜test𤹜𤵜𤱜

TC will also step to the middle of them but does not insert a replacement character with the cursor.
So as long as there are ASCII characters in the first place, the behavior for the first CJK seems to be different.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48083
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

It should work in 8.5, I have added some support for it now. Thanks for providing the sample characters!
Author of Total Commander
https://www.ghisler.com
Post Reply