Incorrect case in UTF-8 search (TC and Lister)

Slavic · Post by *Slavic » 2006-11-25, 13:59 UTC

This bug can be reproduced using this simple example. Here are two Cyrillic words: [nova] (suffix and short form of adjective "new") and [noga] (the leg). The difference is in the 3rd letter. Because this forum isn't in UTF-8, I place here a plain text equivalent of this example (copy and save in ANSI mode, e.g as uni_test.txt):

Code: Select all

РЅРѕРІР°   // nova
РЅРѕРіР°   // noga
РЅРѕРІР°
РЅРѕРіР°

View this file in the Lister in UTF-8 mode, select the first word and copy to the search dialog. Then press Shift+F7. Select the second word, copy to the search dialog and also use Shift+F7. In both cases Lister shows all words as instances, but it's wrong!

Cause: by default, Lister and TC use case-insensitive search. However, in UTF-8 case this should be done only with original Unicode source and destination, not with their UTF-8 representations. In my example, the difference is in the characters "I" and "i" in the UTF-8 encoding: they shouldn't be counted as similar in any circumstances.

Post by *ghisler(Author) » 2006-11-26, 11:49 UTC

Indeed case-insensitive search in UTF8 only works for English texts at this time, sorry. I found no way to make it work with UTF8 because of the variable width of characters.

Slavic · Post by *Slavic » 2006-11-27, 16:08 UTC

Well, I agree that making a case-insensitive search in UTF-8 is not an easy thing, because functions like stricmp() can only work with ANSI and binary Unicode.

Could it be better to enable for UTF-8 only the case-sensitive search: this solution can fix the bug I have found? In this case, if user selects the checkbox "UTF8", the checkbox "Case sensitive" will be selected automatically. For case-insensitive search in ANSI texts (English etc) user can always switch back to the simple text mode and clear this checkbox.

Post by *ghisler(Author) » 2006-11-27, 20:47 UTC

Could it be better to enable for UTF-8 only the case-sensitive search

I thought about that too, but then even English searches would be case-sensitive only...

A new function is planned, but I don't know when I can add it - my to do list is very very long, and is growing every day.

Post by *ghisler(Author) » 2007-01-19, 16:18 UTC

This _should_ work now for forward searches in UTF8 (not backwards yet). Can anyone test this, please?

Flint · Post by *Flint » 2007-01-19, 17:21 UTC

Yes, I performed some tests (not too thorough however) with this feature, it seems to work fine now.

Post by *ghisler(Author) » 2007-01-19, 17:42 UTC

Great, thanks!