-[8.50b7] Search: RegEx doen't work with UTF-8 or UTF-16
Moderators: Hacker, petermad, Stefan2, white
-[8.50b7] Search: RegEx doen't work with UTF-8 or UTF-16
When I check UTF-8 or UTF-16, TC unchecks RegEx checkbox. So, I can only use RegEx with ANSI.
It is logical to have RegEx support for all encodings (ANSI, UTF-8, UTF-16). It is easy to do: use UTF-16 RegEx library and convert all ANSI or UTF-8 strings (both RegEx expression and input string) to UTF-16 before RegEx call.
It is logical to have RegEx support for all encodings (ANSI, UTF-8, UTF-16). It is easy to do: use UTF-16 RegEx library and convert all ANSI or UTF-8 strings (both RegEx expression and input string) to UTF-16 before RegEx call.
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Currently this isn't supported, sorry. For UTF-16, it would be theoretically possible because I use it already for file names, but I had some troubles with 0 words (which only appear in files, not file names).
For UTF-8 it's not possible because characters have different widths, so something like character classes like [a-zA-Z] would be very difficult in UTF-8.
For UTF-8 it's not possible because characters have different widths, so something like character classes like [a-zA-Z] would be very difficult in UTF-8.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
2MVV
I mean words (2 bytes) with all 0, like 00 00. This is the end of string character in C strings. Currently the Unicode regex function cannot search beyond them.
I mean words (2 bytes) with all 0, like 00 00. This is the end of string character in C strings. Currently the Unicode regex function cannot search beyond them.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Ah, you meant nulls.
Is your regex library open-source? Is it hard to fix it?
Maybe you can replace nulls with some other characters (Unicode contains 64K symbols so there are a plenty unused ones - e.g. 0xFFFF or some other) before passing to regex function? In such case you can also modify regex expression in order to allow searching for nulls (e.g. replace substrings \x00 and \x{0000} with \xFFFF in case of using 0xFFFF as null representer - if we do search in Unicode, our expression should be in Unicode too). It will be quite useful because sometimes we really need to search for nulls. Maybe you can even add INI option for choosing this replacement character (e.g. with value $BBBB00AA where 0xAA is the null representer for ANSI and 0xBBBB is the one for Unicode) for cases when user needs to use the character itself.
BTW I think TC should also process \xnn and \x{nnnn} in regex expressions internally before passing to regex function: you simply replace these with characters (ANSI converted to Unicode or direct Unicode) so it will work correctly while searching.
With such conversions regex may work with simultaneous search for text in all encodings.
Is your regex library open-source? Is it hard to fix it?
Maybe you can replace nulls with some other characters (Unicode contains 64K symbols so there are a plenty unused ones - e.g. 0xFFFF or some other) before passing to regex function? In such case you can also modify regex expression in order to allow searching for nulls (e.g. replace substrings \x00 and \x{0000} with \xFFFF in case of using 0xFFFF as null representer - if we do search in Unicode, our expression should be in Unicode too). It will be quite useful because sometimes we really need to search for nulls. Maybe you can even add INI option for choosing this replacement character (e.g. with value $BBBB00AA where 0xAA is the null representer for ANSI and 0xBBBB is the one for Unicode) for cases when user needs to use the character itself.
BTW I think TC should also process \xnn and \x{nnnn} in regex expressions internally before passing to regex function: you simply replace these with characters (ANSI converted to Unicode or direct Unicode) so it will work correctly while searching.
With such conversions regex may work with simultaneous search for text in all encodings.
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Yes, it's open source, but it's no longer maintained by the developer. But I can try something myself. The problem is that the user may search for the 0 word too, and even for both 00 00 and FF FF...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
There are words that are not used in Unicode now (according to Wikipedia Unicode map) so there is no sence to search for them using regex (regex usually used for text searches).
BTW I'm pretty sure I will never need to search for Devanagari or Tamil characters or Arabic word ligatures or many others. And I think most TC users will never need them too. But if someone will ever need to search for such characters, he will change null representer value (e.g. set it to 0 to disable).
BTW I'm pretty sure I will never need to search for Devanagari or Tamil characters or Arabic word ligatures or many others. And I think most TC users will never need them too. But if someone will ever need to search for such characters, he will change null representer value (e.g. set it to 0 to disable).
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Yes, but the user may use ranges where this replacement character is included. In other words, it will need a lot of time, which I don't currently have.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
It will be great to see this in further TC versions. 
I think it shouldn't be too hard to realize if you take existing library and replace nulls with some character in regex expression and input buffers. BTW, you can simply scan regex expression in order to find character that user doesn't use - and use such character as null replacement, so it will work always (it is impossible to enter all 65536 characters into search field).

I think it shouldn't be too hard to realize if you take existing library and replace nulls with some character in regex expression and input buffers. BTW, you can simply scan regex expression in order to find character that user doesn't use - and use such character as null replacement, so it will work always (it is impossible to enter all 65536 characters into search field).
Perhaps it is possible to use the C library PCRE. Starting with release 8.32 it is possible to build any combination of the three libraries supporting 8-, 16- and 32-bit character strings (including UTF-8, UTF-16 and UTF-32 strings).
Are there any news I might have missed about this topic?
I have quite some html files which have a meta content type tag with utf in them - so regex search does not work - unless/until I simply delete this line/tag with a normal text editor.
Which does not make sense to me, but admittedly I do not understand much about utf/unicode/software.. - I simply gets on my nerves : )
Thanks for your attention, best regards,
S.
I have quite some html files which have a meta content type tag with utf in them - so regex search does not work - unless/until I simply delete this line/tag with a normal text editor.
Which does not make sense to me, but admittedly I do not understand much about utf/unicode/software.. - I simply gets on my nerves : )
Thanks for your attention, best regards,
S.
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Sorry, I don't currently see a way to do this. You can try to search the files in ANSI mode if you don't need to find non-English texts.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
- ghisler(Author)
- Site Admin
- Posts: 50549
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
The problem is that UTF-8 can be multiple bytes long, and it has to deal with upper-/lowercase in search strings like [äöüÄÖÜß]. While UTF-16 has dual word characters too, they don't usually have upper-/lowercase (they are mostly externded Chinese forms etc).
The same is true for dual byte ANSI, e.g. Chinese encoding. Here it's still possible to match each byte/word separately. With UTF-8, this isn't possible.
I could try to convert to UTF-16 before searching, but this will fail with e.g. EXE files where there are also invalid UTF-8 codes.
The same is true for dual byte ANSI, e.g. Chinese encoding. Here it's still possible to match each byte/word separately. With UTF-8, this isn't possible.
I could try to convert to UTF-16 before searching, but this will fail with e.g. EXE files where there are also invalid UTF-8 codes.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com