Unicode and UTF-8 in search

Here you can propose new features, make suggestions etc.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
Thany
Senior Member
Senior Member
Posts: 293
Joined: 2003-09-30, 09:20 UTC
Location: Netherlands

Unicode and UTF-8 in search

Post by *Thany »

There are two checkboxes "Unicode" and "UTF-8" in the search dialog.

This is confusing, because you can't select both, even tough UTF-8 *is* unicode. Looking at the help, I see that "unicode" assumes two bytes per character. Sooo..... That's UTF-16 (well, almost - see below). Call it that, to make it less confusing.

Unicode is NOT a character encoding. It's a standard that defines positions of characters in a massively large character set consisting of almost 110,000 characters.

UTF8 is just the most popular encoding, and UTF-16 is usually used for internal filetypes, and performs better than UTF-8 for large quantities of higher-than-0x80 characters. But mind you, UTF-16 still allows for 4 bytes per character, just like UTF-8 allows for more than 1 byte per character.

There's also the massively wasteful UTF-32 encoding, which has a fixed 4 bytes per character. On some systems, this might be neccesary for certain haracters, because there are more than 65535 positions in the unicode "set". I won't be mentioning UTF-7 that's used to squeeze through archaic systems that assume "7 bits is quite enough, thank you".

So that's my 2c :)

Can this confusion be "fixed" in a next version?
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

"UTF-8" has variable-length characters while "Unicode" (which is UTF-16LE) has always 2-byte characters. They are two completely different encodings.

Checkbox says "Unicode" just like Notepad calls UTF-16LE.

And how do you suggest to fix it?

It is not a bug, TC just doesn't support Unicode encodings except UTF-8 and UTF-16LE, other encodings are rare. You always can search for hex data representation.
Thany
Senior Member
Senior Member
Posts: 293
Joined: 2003-09-30, 09:20 UTC
Location: Netherlands

Post by *Thany »

My point is that Unicode is not a character encoding. So checking a box called "unicode" doesn't describe which unicode charset to use in the search. Appearantly it's UTF-16LE, but how would I know that?

Also, UTF-16 is NOT always 2 bytes per character, as described on wikipedia. It allows for 4 bytes per character as well, and some semi-rare characters are really that high up in the range. And mind you, almost half of the entire unicode set would have to use 4 bytes, because you just can't represent 110,000 characters in 2 bytes.
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

TOTALCMD.chm wrote:Unicode
Search in unicode files. In these, each letter is coded by 2 bytes. This format is used mainly on Windows NT and Windows 2000.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50549
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Indeed I chose the word "Unicode" instead of UTF-16 because that's what notepad uses too. Also UTF-16 is the standard Unicode format used by most Windows apps, including e.g. regedit.
Author of Total Commander
https://www.ghisler.com
User avatar
petermad
Power Member
Power Member
Posts: 16032
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Post by *petermad »

2Thany

Well, a workaround is to change string 5613 in your language file.
License #524 (1994)
Danish Total Commander Translator
TC 11.51 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1391a
TC 3.60b4 on Android 6, 13, 14
TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
Post Reply