-[850b15] Regular Expressions not available in Viewer

kleena · Post by *kleena » 2014-01-09, 14:32 UTC

Please correct me if this is not the right place for this report, but I have noticed this problem in 8.01 and now eagerly testen in 850b15 - it's still not resolved.

Let me describe:
When using the file search (Alt-F7) and specifying a contained text + the "regular expression" option, following happens:

- The file search searches correctly

- When viewing one file from the hits using F3 and then continuing to search for text occurences with F3, the viewer does respect the regular expression mode (which has been set in the Alt-F7 search) - ie it jumps across the correct RegEx hits

But then, once you press Ctrl+F in the viewer, the RegEx mode is lost. The search mask doesn't support it and it obviously persistently resets the RegEx mode internally.
Moreover, when closing the viewer and continuing with other hits from the Alt-F7 hit list, the RegExp mode is still gone and there is no way to set it back until a new search is made.

I guess the internal viewer supports the RegExp mode internally, it just doesn't have a field in the search dialog to be able to control it.
Adding that field may fix all the problems described.

Best regards,
KK

Dalai · Post by *Dalai » 2014-01-09, 15:17 UTC

Which type of file are you viewing? Is Lister set to Unicode view? In Unicode (6), UTF-8 (7) and Multimedia (4) RegEx is disabled but it is available in the other view modes.

Regards
Dalai

kleena · Post by *kleena » 2014-01-09, 15:31 UTC

Thank you for that hint - that brings us closer to the core of the problem. I wasn't aware of that.

The file I was viewing really was an UTF8 XML file. The Lister wasn't set to that type explicitly, probably entered that mode automatically as it saw UTF8 data.
When I untick UTF8, I see the RegEx checkbox and all would be OK.

Still, the RegEx obviously isdn't properly disabled in the UTF8 mode, because when the RegEx mode is "inherited" from the Alt-F7 search, it works (Lister is searching in my UTF8 file by RegEx). It's apparently only disabled in the search dialog.

Why is it disabled, anyway? It seems to work internally.

Post by *ghisler(Author) » 2014-01-10, 16:44 UTC

Lister is searching in my UTF8 file by RegEx

This doesn't work - it probably searches in ANSI mode with RegEx. Or did you try text with accents?

kleena · Post by *kleena » 2014-01-10, 22:47 UTC

It searches for the RegEx I entered and when I look at the charset mode in the menu, it says UTF8.

Post by *ghisler(Author) » 2014-01-12, 08:59 UTC

Yes, because it's still checked - but searches by regex and UTF8 are impossible because some UTF-8 chars are made of multiple bytes.

Post by *ghisler(Author) » 2014-01-17, 10:30 UTC

OK, I have analyzed the code now:
What happened is that you searched for ANSI+Regex, but not for UTF-8. Since ANSI and UTF-8 are the same for English characters, you will also find files with UTF-8 encoding. You cannot find UTF-8 characters with more than 1 byte this way, though, e.g. accents or umlauts.

Now when you open such a file in Lister, it will of course be shown in UTF-8 mode. RegEx can still work because you didn't search for multi-byte UTF-8 characters. Therefore it stays enabled. However, when you open the search dialog to enter a search string yourself, it cannot be guaranteed that you don't enter multi-byte characters. Therefore UTF-8 gets disabled.

In conclusion, I prefer to keep the current behaviour, so people can continue to find English texts found via search also in Lister.

kleena · Post by *kleena » 2014-01-17, 10:49 UTC

Thank you for looking into it.
I was expecting something like that and, domn;t get me wrong, I like the possibility to continue searching in Lister very much. I wouldn't like it removed.

I much more found it a pity that after opening the search dialog in Lister, this gentle state gets broken, RegEx internally "unchecked" and you no longer search for regEx.
So I thought maybe the checkbox could be shown in the dialog, along with a warning that i only searches in ANSI mode in this case...or not be shown, but be internally still forwarded as "checked" (along the lines of "a setting that cannot be seen, shouldn;t have any effect either").
Would something like that be possible?

But one more thing, coming back to what you said about the impossiblity to search for RegEx in Unicode. Somehow that don't fit in my head.

Once you "interpreted" Unicode, you get internal character representations... "ü" and "č" and the like. These are characters, no matter how they looked as bytes. So why wouldn't the RegEx machine, working on the "interpreted" version of the data, be able to treat them as characters?
I actually tested it in Eclipse now, created an UTF8 file with Czech and German special characters and searched for "[Üč]" then - which worked.

So, wouldn't there a way to align it all and just have regex work with Unicode too?

Post by *ghisler(Author) » 2014-01-17, 17:23 UTC

The problem is when the user writes something like ü*

In UTF-8, this is something like Ã¼* in ANSI.
The problem here is that not just ¼ must be repeated, but the entire 2 byte sequence. This is just a small example of the problems which occur with multi-byte characters and regular expressions.