Automatic text encoding in search?

Lefteous · Post by *Lefteous » 2016-08-29, 11:31 UTC

Following the discussion on the search dialog I would like to know if there is some kind of automatic text encoding detection in search. Currently the user has to check all for available text encodings to make sure the text inside the plain text file will be found. This means in my understanding that the file is search four times which probably takes along time when files are big and many of them.
In Lister there is a detection mechanism and the file is displayed according to this detection mechanism.

Sob · Post by *Sob » 2016-08-29, 22:00 UTC

I think there isn't any detection. And if it was to be added, then as optional feature, because TC's ability to find text in any kind of file (binaries, damaged files, ...) is very useful. It wouldn't work well if TC decided at the beginning of file that it uses e.g. utf-8 and only searched for that.

Regarding effectivity of searching for multiple encodings at once, I checked TC using Process Monitor and Process Explorer. Files are read only once, that's fine. But searching itself uses only one thread, so it can be bottleneck if you have very fast disk. I'm sure it would be possible to split it into more threads if needed (one for reading the file and then one searching thread for each encoding).

milo1012 · Post by *milo1012 » 2016-08-30, 01:08 UTC

I wondered for a long time why TC has a basic encoding detection for CBC and Lister but doesn't offer such thing for the search function.
So
+1 for having an additional encoding detection.
This might work as a mutual exclusion option, i.e. all other text encoding options are disabled when automatic detection is enabled and vice versa.

BTW, I don't think the read performance is much of a bottleneck when doing basic text search with all three/four basic encodings, unless you have a really ancient CPU. Even a ~15 yr old system can stream several hundred MB/s through the ALU for byte check purposes. I did some benchmark with my PCREsearch plug-in and it wasn't that much faster than TC using all encodings.

2Sob
I think it would make more sense to detect SSD (= non-mechanical) drives first and - if found - possibly using parallel I/O threads on top of that when having a multi processor system. For a mechanical HDD, no matter how fast it is, parallel I/O is always a bad choice.

Post by *ghisler(Author) » 2016-08-31, 16:13 UTC

Some files like EXE can contain a mix of UTF-8, UTF-16, plain Ansi etc. It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.

milo1012 · Post by *milo1012 » 2016-08-31, 18:19 UTC

ghisler(Author) wrote:It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.

I agree that it only works for mostly plain-text files, but you can still have it as an OPTIONAL detection, like when the user pre-filters certain file types in the search mask anyway.
Binary file types would be treated as ANSI with such detection mechanism, because UTF-8 will be invalid and UTF-16 is probably not fitting as well, so I don't think that it would be much or a problem when the user would only find ANSI strings with such option and files, as he should still be able to switch to the old text search options that we have now.

j7n · Post by *j7n » 2016-09-01, 00:41 UTC

UTF-8 is possible in mixed format binary files, usually in the form of embedded XML, and UTF-16 (two byte old Unicode) as common as ANSI in Windows DLLs. If I want to look for a variable or a registry tweak, I definitely want to search both ansi and Unicode at the same time.

If searching in music/video metadata was ever implemented in TC, that would be another example where UTF-8 and Unicode both occur.

For example, TC finds these two strings in German AuthHost.exe.mui on Windows 10.

"lässt eine Benutzereingabeaufforderung für" (unicode)

"Überprüfen Sie die Netzwerkverbindung" (utf-8 )

An auto detection might increase speed and decrease false positives. But search is fast enough as it is. Wrong matches can be avoided by simply increasing the length of the query or unselecting encodings manually.

MVV · Post by *MVV » 2016-09-01, 07:09 UTC

Non-text files may have text in any encoding (ANSI, OEM, UTF-8, UTF-16 etc) or charset. But for text files it is possible to add some simple auto-detection... However it may be a problem to detect if file is a text one (it may contain text in the beginning but binary contents next - e.g. bash SFX scripts in front of archives)...

j7n · Post by *j7n » 2016-09-01, 07:49 UTC

Perhaps automatic selection of encoding could be done for text files where it is unambiguous: HTML and XML files with a declared encoding, and other files that begin with a UTF byte order mark. Then maybe search could also use the right declared "ansi" encoding instead of the system default.

milo1012 · Post by *milo1012 » 2016-09-01, 16:00 UTC

2j7n and MVV
Well, that is how things work for many (most) text editors out there for years now and also how I implemented my PCREsearch plug-in (including a configurable analysis buffer, but except the HTML/XML part, as it's superfluously), so no need to reinvent the wheal.
And like I said: I'm pretty sure that Christian already does similar things in the automatic detection for TC's CBC. He'd just need to port this detection mechanism to the search function, maybe tune it a bit and maybe also make the data analysis buffer for the detection user configurable (ini option) and then we'd have it.

MVV · Post by *MVV » 2016-09-01, 17:27 UTC

There is no chance to miss anything in Lister, but there is one in Find Files, so this feature should only be used on pure text files.

And, it shouldn't be configurable via INI, it should be another search encoding option (ANSI, ASCII, UTF-16LE, UTF-8, Auto-Detect Encoding).

milo1012 · Post by *milo1012 » 2016-09-01, 17:54 UTC

MVV wrote:And, it shouldn't be configurable via INI

I said:

make the data analysis buffer for the detection user configurable (ini option)

so of course not the option itself.

MVV wrote:it should be another search encoding option (ANSI, ASCII, UTF-16LE, UTF-8, Auto-Detect Encoding)

Yes, exactly like that, and like I said: this might work as a mutual exclusion option, i.e. all other text encoding options are disabled when automatic detection is enabled and vice versa.

MVV wrote:so this feature should only be used on pure text files

Again: I already explained that this would be completely optional for the user, so if he chooses to use this option, he will find only ANSI strings for non-text files, but is of course still free to use the usual encoding options that we have now instead (if the results weren't satisfactory). Besides that, he is free to pre-filter file types through the other search options anyway.

Lefteous · Post by *Lefteous » 2016-09-07, 13:58 UTC

2ghisler(Author)

Some files like EXE can contain a mix of UTF-8, UTF-16, plain Ansi etc. It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.

In Lister there seems to be a mechanism which is used to find out if the binary or text mode is used. I think something similar should be used here too. For binary files (which are not handled by a plugin) all encodings make sense - so all should be used. I don't see how the user could decide which not use. I see the use case that binary files should not be searched at all though.
For text files it auto-detect be used.