-[8.50b7] Search: RegEx doen't work with UTF-8 or UTF-16

MVV · Post by *MVV » 2013-10-24, 06:55 UTC

When I check UTF-8 or UTF-16, TC unchecks RegEx checkbox. So, I can only use RegEx with ANSI.

It is logical to have RegEx support for all encodings (ANSI, UTF-8, UTF-16). It is easy to do: use UTF-16 RegEx library and convert all ANSI or UTF-8 strings (both RegEx expression and input string) to UTF-16 before RegEx call.

Post by *ghisler(Author) » 2013-10-24, 14:49 UTC

Currently this isn't supported, sorry. For UTF-16, it would be theoretically possible because I use it already for file names, but I had some troubles with 0 words (which only appear in files, not file names).

For UTF-8 it's not possible because characters have different widths, so something like character classes like [a-zA-Z] would be very difficult in UTF-8.

MVV · Post by *MVV » 2013-10-26, 20:14 UTC

with 0 words

Can you explain it better? Can't understand.

characters have different widths

For UTF-8 the easiest solution is to convert both search string (once) and file lines to UTF-16.

Post by *ghisler(Author) » 2013-10-27, 14:50 UTC

2MVV
I mean words (2 bytes) with all 0, like 00 00. This is the end of string character in C strings. Currently the Unicode regex function cannot search beyond them.

MVV · Post by *MVV » 2013-10-27, 18:33 UTC

Ah, you meant nulls.

Is your regex library open-source? Is it hard to fix it?

Maybe you can replace nulls with some other characters (Unicode contains 64K symbols so there are a plenty unused ones - e.g. 0xFFFF or some other) before passing to regex function? In such case you can also modify regex expression in order to allow searching for nulls (e.g. replace substrings \x00 and \x{0000} with \xFFFF in case of using 0xFFFF as null representer - if we do search in Unicode, our expression should be in Unicode too). It will be quite useful because sometimes we really need to search for nulls. Maybe you can even add INI option for choosing this replacement character (e.g. with value $BBBB00AA where 0xAA is the null representer for ANSI and 0xBBBB is the one for Unicode) for cases when user needs to use the character itself.

BTW I think TC should also process \xnn and \x{nnnn} in regex expressions internally before passing to regex function: you simply replace these with characters (ANSI converted to Unicode or direct Unicode) so it will work correctly while searching.

With such conversions regex may work with simultaneous search for text in all encodings.

Post by *ghisler(Author) » 2013-10-28, 15:28 UTC

Yes, it's open source, but it's no longer maintained by the developer. But I can try something myself. The problem is that the user may search for the 0 word too, and even for both 00 00 and FF FF...

MVV · Post by *MVV » 2013-10-28, 17:36 UTC

There are words that are not used in Unicode now (according to Wikipedia Unicode map) so there is no sence to search for them using regex (regex usually used for text searches).

BTW I'm pretty sure I will never need to search for Devanagari or Tamil characters or Arabic word ligatures or many others. And I think most TC users will never need them too. But if someone will ever need to search for such characters, he will change null representer value (e.g. set it to 0 to disable).

Post by *ghisler(Author) » 2013-10-29, 16:47 UTC

Yes, but the user may use ranges where this replacement character is included. In other words, it will need a lot of time, which I don't currently have.

MVV · Post by *MVV » 2013-10-29, 17:42 UTC

It will be great to see this in further TC versions.

I think it shouldn't be too hard to realize if you take existing library and replace nulls with some character in regex expression and input buffers. BTW, you can simply scan regex expression in order to find character that user doesn't use - and use such character as null replacement, so it will work always (it is impossible to enter all 65536 characters into search field).

Post by *white » 2013-10-29, 23:59 UTC

Perhaps it is possible to use the C library PCRE. Starting with release 8.32 it is possible to build any combination of the three libraries supporting 8-, 16- and 32-bit character strings (including UTF-8, UTF-16 and UTF-32 strings).

Symlink · Post by *Symlink » 2015-05-27, 19:32 UTC

Are there any news I might have missed about this topic?

I have quite some html files which have a meta content type tag with utf in them - so regex search does not work - unless/until I simply delete this line/tag with a normal text editor.
Which does not make sense to me, but admittedly I do not understand much about utf/unicode/software.. - I simply gets on my nerves : )

Thanks for your attention, best regards,
S.

Post by *ghisler(Author) » 2015-05-28, 14:11 UTC

Sorry, I don't currently see a way to do this. You can try to search the files in ANSI mode if you don't need to find non-English texts.

MVV · Post by *MVV » 2015-05-28, 14:49 UTC

I think you can simply leave searching for nulls as a limitation and do basic Unicode support, it will already be a very nice improvement. It won't be a problem if it won't support non-printable characters like 0000 or 0001 (it may also be used for masking nulls) but support all other characters.

Post by *ghisler(Author) » 2015-05-28, 17:06 UTC

The problem is that UTF-8 can be multiple bytes long, and it has to deal with upper-/lowercase in search strings like [äöüÄÖÜß]. While UTF-16 has dual word characters too, they don't usually have upper-/lowercase (they are mostly externded Chinese forms etc).

The same is true for dual byte ANSI, e.g. Chinese encoding. Here it's still possible to match each byte/word separately. With UTF-8, this isn't possible.

I could try to convert to UTF-16 before searching, but this will fail with e.g. EXE files where there are also invalid UTF-8 codes.

MVV · Post by *MVV » 2015-05-28, 18:16 UTC

You can at least convert UTF-8 to UTF-16 where it is possible, and let invalid characters be converted to anything (on OS wish). Regex is a text processing thing, it shouldn't work with non-printable data. It is also easy to skip invalid UTF-8 characters, MultiByteToWideChar does it by default.