Unicode and UTF-8 in search

Thany · Post by *Thany » 2011-08-09, 16:01 UTC

There are two checkboxes "Unicode" and "UTF-8" in the search dialog.

This is confusing, because you can't select both, even tough UTF-8 *is* unicode. Looking at the help, I see that "unicode" assumes two bytes per character. Sooo..... That's UTF-16 (well, almost - see below). Call it that, to make it less confusing.

Unicode is NOT a character encoding. It's a standard that defines positions of characters in a massively large character set consisting of almost 110,000 characters.

UTF8 is just the most popular encoding, and UTF-16 is usually used for internal filetypes, and performs better than UTF-8 for large quantities of higher-than-0x80 characters. But mind you, UTF-16 still allows for 4 bytes per character, just like UTF-8 allows for more than 1 byte per character.

There's also the massively wasteful UTF-32 encoding, which has a fixed 4 bytes per character. On some systems, this might be neccesary for certain haracters, because there are more than 65535 positions in the unicode "set". I won't be mentioning UTF-7 that's used to squeeze through archaic systems that assume "7 bits is quite enough, thank you".

So that's my 2c

Can this confusion be "fixed" in a next version?

MVV · Post by *MVV » 2011-08-10, 08:07 UTC

"UTF-8" has variable-length characters while "Unicode" (which is UTF-16LE) has always 2-byte characters. They are two completely different encodings.

Checkbox says "Unicode" just like Notepad calls UTF-16LE.

And how do you suggest to fix it?

It is not a bug, TC just doesn't support Unicode encodings except UTF-8 and UTF-16LE, other encodings are rare. You always can search for hex data representation.

Thany · Post by *Thany » 2011-08-10, 21:17 UTC

My point is that Unicode is not a character encoding. So checking a box called "unicode" doesn't describe which unicode charset to use in the search. Appearantly it's UTF-16LE, but how would I know that?

Also, UTF-16 is NOT always 2 bytes per character, as described on wikipedia. It allows for 4 bytes per character as well, and some semi-rare characters are really that high up in the range. And mind you, almost half of the entire unicode set would have to use 4 bytes, because you just can't represent 110,000 characters in 2 bytes.

MVV · Post by *MVV » 2011-08-11, 04:54 UTC

TOTALCMD.chm wrote:Unicode
Search in unicode files. In these, each letter is coded by 2 bytes. This format is used mainly on Windows NT and Windows 2000.

Post by *ghisler(Author) » 2011-08-11, 13:02 UTC

Indeed I chose the word "Unicode" instead of UTF-16 because that's what notepad uses too. Also UTF-16 is the standard Unicode format used by most Windows apps, including e.g. regedit.

Post by *petermad » 2011-08-12, 18:51 UTC

2Thany

Well, a workaround is to change string 5613 in your language file.