Compare by content and Unicode CJK symbols

milo1012 · Post by *milo1012 » 2013-08-02, 22:37 UTC

While writing some code for Unicode handling I often rely on TCs "Compare by Content"
when comparing Text with rare Unicode characters.
While doing some tests with some of them for myself,
I found that TC seems to fail comparing them in different encodings:

Code: Select all

𤽜𤹜𤵜𤱜

(for clarification: it's U+24F5C U+24E5C U+24D5C U+24C5C)

Just use one or all four characters, write them to an e.g. UTF-16 file.
(just these characters, no additional ASCII chars or others)
Now use the same character sequence and encode it to an UTF-8 file.
Do a Compare by Content for both files in Text mode and:

TC highlights them as different line(s), although they aren't!

It even fails when comparing BOM-less UTF-16 with BOM-included UTF-16,
i.e. the files just differ in the missing "xFFFE" at the beginning.
(of course, I have to set the encoding of the BOM-less file manually, otherwise it's seen as binary)

(Tested on TC 8.01 32bit, Win 7 64bit
older versions and TC64 not tested so far)

Post by *ghisler(Author) » 2013-08-05, 09:01 UTC

These are 4 byte Unicode characters, which take two instead of one 16-bit Unicode value each. Total Commander does not currently support them yet. You can see that when you turn on the cursor with F6 and try to step through these characters with the arrow keys. TC will step also in the middle of such a character. I will try to support them, but cannot promise anything yet.

milo1012 · Post by *milo1012 » 2013-08-05, 13:35 UTC

I see. So you say a UTF-16 surrogate pair isn't supported yet. Good to know,
since it's quite a lot of characters (all that are not in the Basic Multilingual Plane)
which aren't comparable that way (yet).

TC will step also in the middle of such a character.

Confirmed.
Also and strangely, TC will insert two replacement characters ￹ for the first CJK symbol.

But strangely when I do sth. like this:
test𤽜test𤹜𤵜𤱜

TC will also step to the middle of them but does not insert a replacement character with the cursor.
So as long as there are ASCII characters in the first place, the behavior for the first CJK seems to be different.

Post by *ghisler(Author) » 2013-08-08, 13:15 UTC

It should work in 8.5, I have added some support for it now. Thanks for providing the sample characters!