Problems searching for capital Russian letters in UTF-16BE

Bug reports will be moved here when the described bug has been fixed

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Problems searching for capital Russian letters in UTF-16BE

Post by *Flint »

1. Unpack the two text files from the archive:

Code: Select all

MIME-Version: 1.0
Content-Type: application/octet-stream; name="test.rar"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="test.rar"

UmFyIRoHADvQcwgADQAAAAAAAADJhHSAkCsAFwAAAAoAAAACoB78iB2m2EAdNQYAIAAAAGJlLnR4
dACwtvtcAREL7QfOiUDwBT+TdsDwZnrFeJRZhvBAN3SQkCsAAwAAAAoAAAAClqhRUyCm2EAdNQYA
IAAAAGxlLnR4dACwjIZHZ8OQxD17AEAHAA==
2. These files contain the Russian word "Тест" in UTF-16LE and UTF-16BE encoding, respectively. Open Find Files in TC, specify searching for text Тест, check the "Unicode" option, press "Start search".
3. Only the le.txt file is found.

If you search for ест (that is, the part of the word without the first capital letter), both le.txt and be.txt will be found.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
MaxX
Power Member
Power Member
Posts: 1029
Joined: 2012-03-23, 18:15 UTC
Location: UA

Post by *MaxX »

2Flint
Did TC ever work with unicode 16BE before?
Ukrainian Total Commander Translator. Feedback and discuss.
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

MaxX wrote:Did TC ever work with unicode 16BE before?
At least Lister supports it fine. As for the Find Files function, I'm not sure. But even if it wasn't supposed to work with UTF-16BE files, the fact is, it already does (unless no capital letters are present), so improving it to work with capital letters would only sound logical.

BTW, I tested with English text, and situation is the same. So there is no need to check particularly with cyrillic text. (I was just following the original description of the bug from one of the Russian forums, and didn't think to check it on non-Russian contents.)
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48088
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

I will check it - Unicode BE should work when there is a BOM, and there is definitely one in the file.
Author of Total Commander
https://www.ghisler.com
User avatar
white
Power Member
Power Member
Posts: 4623
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Re: Problems searching for capital Russian letters in UTF-16

Post by *white »

Flint wrote:3. Only the le.txt file is found.
Confirmed. Furthermore, when I enable "Case sensitive" both files are found.
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

white wrote:Furthermore, when I enable "Case sensitive" both files are found.
That's right. Sorry, forgot to mention about it.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

UTF-16 BE has reversed byte order in words (its BOM also has reversed bytes order in comparison with UTF-16LE BOM: FFFE and FEFF).
It is an accident that a part of a word was found, and only because of byte mix:

Mentioned word "Тест" in UTF-16 BE:
0422043504410442
Mentioned word "Тест" in UTF-16 LE:
2204350441044204
In red there is a common part of byte arrays (exactly "ест" in UTF-16BE) - so it was found only because that sequence appears in UTF-16BE file - it is a collision (first bytes of theese letters in BE are 04 and last ones in LE are 04 too). And whole word can't be found because its byte array doesn't appear here. Also, in case of different high-order bytes in characters TC won't find text at all. Any Unicode band has such collisions.

Currently TC doesn't support UTF-16BE in find files dialog (however simple byte order reverse may allow searching in BE, but currently there is no corresponding checkbox).

However it is interesting that Lister supports UTF-16BE too.
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

MVV wrote:it is a collision
I thought so too, so I tested with a different couple of files, both containing the English word " Test ", with a space before and a space after the word. In this case the whole word "Test" (without spaces) would fit into that collision of identical byte sequences. But still TC refuses to find the second file.

Or even simpler: just replace the "Тест" with "ТЕСТ" in both files. Collision remains, but TC fails to find the UTF-16BE file if you search for "ест" or "ЕСТ".
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Yeah, seems that something is really wrong here. It is strange that with 'match case' option TC finds both files and w/o that option - only one file.

BTW in case of files files w/o BOM TC finds both ones using pattern "ТЕС" (first three letters). So it seems that TC tries to read BOM markers and does something crazy.
User avatar
white
Power Member
Power Member
Posts: 4623
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

MVV wrote:It is an accident that a part of a word was found..
No, it wasn't. Try shortening the string in the test file to "е" and you will see it does work (search for "е"). However case insensitive unicode search fails on upper case characters. Try for example a test file containing "Е".
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

Cannot reproduce the problem anymore in rc2, so seems to be fixed.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48088
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Thanks! Btw, the problem wasn't related to Russian, it happend with "Test" in English too.
Author of Total Commander
https://www.ghisler.com
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

ghisler(Author) wrote:Btw, the problem wasn't related to Russian, it happend with "Test" in English too.
Yes, I noticed that later too…
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
white
Power Member
Power Member
Posts: 4623
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

Yes. Seems to be fixed in 8.01 rc2. (Tested 32-bit)
Post Reply