Page 1 of 1

Unicode and UTF-8 in search

Posted: 2011-08-09, 16:01 UTC
by Thany
There are two checkboxes "Unicode" and "UTF-8" in the search dialog.

This is confusing, because you can't select both, even tough UTF-8 *is* unicode. Looking at the help, I see that "unicode" assumes two bytes per character. Sooo..... That's UTF-16 (well, almost - see below). Call it that, to make it less confusing.

Unicode is NOT a character encoding. It's a standard that defines positions of characters in a massively large character set consisting of almost 110,000 characters.

UTF8 is just the most popular encoding, and UTF-16 is usually used for internal filetypes, and performs better than UTF-8 for large quantities of higher-than-0x80 characters. But mind you, UTF-16 still allows for 4 bytes per character, just like UTF-8 allows for more than 1 byte per character.

There's also the massively wasteful UTF-32 encoding, which has a fixed 4 bytes per character. On some systems, this might be neccesary for certain haracters, because there are more than 65535 positions in the unicode "set". I won't be mentioning UTF-7 that's used to squeeze through archaic systems that assume "7 bits is quite enough, thank you".

So that's my 2c :)

Can this confusion be "fixed" in a next version?

Posted: 2011-08-10, 08:07 UTC
by MVV
"UTF-8" has variable-length characters while "Unicode" (which is UTF-16LE) has always 2-byte characters. They are two completely different encodings.

Checkbox says "Unicode" just like Notepad calls UTF-16LE.

And how do you suggest to fix it?

It is not a bug, TC just doesn't support Unicode encodings except UTF-8 and UTF-16LE, other encodings are rare. You always can search for hex data representation.

Posted: 2011-08-10, 21:17 UTC
by Thany
My point is that Unicode is not a character encoding. So checking a box called "unicode" doesn't describe which unicode charset to use in the search. Appearantly it's UTF-16LE, but how would I know that?

Also, UTF-16 is NOT always 2 bytes per character, as described on wikipedia. It allows for 4 bytes per character as well, and some semi-rare characters are really that high up in the range. And mind you, almost half of the entire unicode set would have to use 4 bytes, because you just can't represent 110,000 characters in 2 bytes.

Posted: 2011-08-11, 04:54 UTC
by MVV
TOTALCMD.chm wrote:Unicode
Search in unicode files. In these, each letter is coded by 2 bytes. This format is used mainly on Windows NT and Windows 2000.

Posted: 2011-08-11, 13:02 UTC
by ghisler(Author)
Indeed I chose the word "Unicode" instead of UTF-16 because that's what notepad uses too. Also UTF-16 is the standard Unicode format used by most Windows apps, including e.g. regedit.

Posted: 2011-08-12, 18:51 UTC
by petermad
2Thany

Well, a workaround is to change string 5613 in your language file.