-[8.50b7] Search: RegEx doen't work with UTF-8 or UTF-16

The behaviour described in the bug report is either by design, or would be far too complex/time-consuming to be changed

Moderators: Hacker, petermad, Stefan2, white

Post Reply
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

-[8.50b7] Search: RegEx doen't work with UTF-8 or UTF-16

Post by *MVV »

When I check UTF-8 or UTF-16, TC unchecks RegEx checkbox. So, I can only use RegEx with ANSI.

It is logical to have RegEx support for all encodings (ANSI, UTF-8, UTF-16). It is easy to do: use UTF-16 RegEx library and convert all ANSI or UTF-8 strings (both RegEx expression and input string) to UTF-16 before RegEx call.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Currently this isn't supported, sorry. For UTF-16, it would be theoretically possible because I use it already for file names, but I had some troubles with 0 words (which only appear in files, not file names).

For UTF-8 it's not possible because characters have different widths, so something like character classes like [a-zA-Z] would be very difficult in UTF-8.
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

with 0 words
Can you explain it better? Can't understand.
characters have different widths
For UTF-8 the easiest solution is to convert both search string (once) and file lines to UTF-16.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

2MVV
I mean words (2 bytes) with all 0, like 00 00. This is the end of string character in C strings. Currently the Unicode regex function cannot search beyond them.
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Ah, you meant nulls.

Is your regex library open-source? Is it hard to fix it?

Maybe you can replace nulls with some other characters (Unicode contains 64K symbols so there are a plenty unused ones - e.g. 0xFFFF or some other) before passing to regex function? In such case you can also modify regex expression in order to allow searching for nulls (e.g. replace substrings \x00 and \x{0000} with \xFFFF in case of using 0xFFFF as null representer - if we do search in Unicode, our expression should be in Unicode too). It will be quite useful because sometimes we really need to search for nulls. Maybe you can even add INI option for choosing this replacement character (e.g. with value $BBBB00AA where 0xAA is the null representer for ANSI and 0xBBBB is the one for Unicode) for cases when user needs to use the character itself.

BTW I think TC should also process \xnn and \x{nnnn} in regex expressions internally before passing to regex function: you simply replace these with characters (ANSI converted to Unicode or direct Unicode) so it will work correctly while searching.

With such conversions regex may work with simultaneous search for text in all encodings.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Yes, it's open source, but it's no longer maintained by the developer. But I can try something myself. The problem is that the user may search for the 0 word too, and even for both 00 00 and FF FF...
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

There are words that are not used in Unicode now (according to Wikipedia Unicode map) so there is no sence to search for them using regex (regex usually used for text searches).

BTW I'm pretty sure I will never need to search for Devanagari or Tamil characters or Arabic word ligatures or many others. And I think most TC users will never need them too. But if someone will ever need to search for such characters, he will change null representer value (e.g. set it to 0 to disable).
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Yes, but the user may use ranges where this replacement character is included. In other words, it will need a lot of time, which I don't currently have.
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

It will be great to see this in further TC versions. :)

I think it shouldn't be too hard to realize if you take existing library and replace nulls with some character in regex expression and input buffers. BTW, you can simply scan regex expression in order to find character that user doesn't use - and use such character as null replacement, so it will work always (it is impossible to enter all 65536 characters into search field).
User avatar
white
Power Member
Power Member
Posts: 5819
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Post by *white »

Perhaps it is possible to use the C library PCRE. Starting with release 8.32 it is possible to build any combination of the three libraries supporting 8-, 16- and 32-bit character strings (including UTF-8, UTF-16 and UTF-32 strings).
Symlink
Junior Member
Junior Member
Posts: 18
Joined: 2005-01-21, 14:55 UTC
Location: .at

Post by *Symlink »

Are there any news I might have missed about this topic?

I have quite some html files which have a meta content type tag with utf in them - so regex search does not work - unless/until I simply delete this line/tag with a normal text editor.
Which does not make sense to me, but admittedly I do not understand much about utf/unicode/software.. - I simply gets on my nerves : )

Thanks for your attention, best regards,
S.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Sorry, I don't currently see a way to do this. You can try to search the files in ANSI mode if you don't need to find non-English texts.
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

I think you can simply leave searching for nulls as a limitation and do basic Unicode support, it will already be a very nice improvement. It won't be a problem if it won't support non-printable characters like 0000 or 0001 (it may also be used for masking nulls) but support all other characters.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 50550
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

The problem is that UTF-8 can be multiple bytes long, and it has to deal with upper-/lowercase in search strings like [äöüÄÖÜß]. While UTF-16 has dual word characters too, they don't usually have upper-/lowercase (they are mostly externded Chinese forms etc).

The same is true for dual byte ANSI, e.g. Chinese encoding. Here it's still possible to match each byte/word separately. With UTF-8, this isn't possible.

I could try to convert to UTF-16 before searching, but this will fail with e.g. EXE files where there are also invalid UTF-8 codes.
Author of Total Commander
https://www.ghisler.com
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

You can at least convert UTF-8 to UTF-16 where it is possible, and let invalid characters be converted to anything (on OS wish). Regex is a text processing thing, it shouldn't work with non-printable data. It is also easy to skip invalid UTF-8 characters, MultiByteToWideChar does it by default.
Post Reply