Automatic text encoding in search?

Here you can propose new features, make suggestions etc.

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
Lefteous
Power Member
Power Member
Posts: 9535
Joined: 2003-02-09, 01:18 UTC
Location: Germany
Contact:

Automatic text encoding in search?

Post by *Lefteous »

Following the discussion on the search dialog I would like to know if there is some kind of automatic text encoding detection in search. Currently the user has to check all for available text encodings to make sure the text inside the plain text file will be found. This means in my understanding that the file is search four times which probably takes along time when files are big and many of them.
In Lister there is a detection mechanism and the file is displayed according to this detection mechanism.
Sob
Power Member
Power Member
Posts: 941
Joined: 2005-01-19, 17:33 UTC

Post by *Sob »

I think there isn't any detection. And if it was to be added, then as optional feature, because TC's ability to find text in any kind of file (binaries, damaged files, ...) is very useful. It wouldn't work well if TC decided at the beginning of file that it uses e.g. utf-8 and only searched for that.

Regarding effectivity of searching for multiple encodings at once, I checked TC using Process Monitor and Process Explorer. Files are read only once, that's fine. But searching itself uses only one thread, so it can be bottleneck if you have very fast disk. I'm sure it would be possible to split it into more threads if needed (one for reading the file and then one searching thread for each encoding).
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

I wondered for a long time why TC has a basic encoding detection for CBC and Lister but doesn't offer such thing for the search function.
So
+1 for having an additional encoding detection.
This might work as a mutual exclusion option, i.e. all other text encoding options are disabled when automatic detection is enabled and vice versa.


BTW, I don't think the read performance is much of a bottleneck when doing basic text search with all three/four basic encodings, unless you have a really ancient CPU. Even a ~15 yr old system can stream several hundred MB/s through the ALU for byte check purposes. I did some benchmark with my PCREsearch plug-in and it wasn't that much faster than TC using all encodings.

2Sob
I think it would make more sense to detect SSD (= non-mechanical) drives first and - if found - possibly using parallel I/O threads on top of that when having a multi processor system. For a mechanical HDD, no matter how fast it is, parallel I/O is always a bad choice.
TC plugins: PCREsearch and RegXtract
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48088
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Some files like EXE can contain a mix of UTF-8, UTF-16, plain Ansi etc. It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.
Author of Total Commander
https://www.ghisler.com
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

ghisler(Author) wrote:It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.
I agree that it only works for mostly plain-text files, but you can still have it as an OPTIONAL detection, like when the user pre-filters certain file types in the search mask anyway.
Binary file types would be treated as ANSI with such detection mechanism, because UTF-8 will be invalid and UTF-16 is probably not fitting as well, so I don't think that it would be much or a problem when the user would only find ANSI strings with such option and files, as he should still be able to switch to the old text search options that we have now.
TC plugins: PCREsearch and RegXtract
User avatar
j7n
Member
Member
Posts: 168
Joined: 2005-08-07, 21:56 UTC

Post by *j7n »

UTF-8 is possible in mixed format binary files, usually in the form of embedded XML, and UTF-16 (two byte old Unicode) as common as ANSI in Windows DLLs. If I want to look for a variable or a registry tweak, I definitely want to search both ansi and Unicode at the same time.

If searching in music/video metadata was ever implemented in TC, that would be another example where UTF-8 and Unicode both occur.

For example, TC finds these two strings in German AuthHost.exe.mui on Windows 10.

"lässt eine Benutzereingabeaufforderung für" (unicode)

"Überprüfen Sie die Netzwerkverbindung" (utf-8 )

An auto detection might increase speed and decrease false positives. But search is fast enough as it is. Wrong matches can be avoided by simply increasing the length of the query or unselecting encodings manually.
#148174 Personal license
Running Total Commander v8.52a
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Non-text files may have text in any encoding (ANSI, OEM, UTF-8, UTF-16 etc) or charset. But for text files it is possible to add some simple auto-detection... However it may be a problem to detect if file is a text one (it may contain text in the beginning but binary contents next - e.g. bash SFX scripts in front of archives)...
User avatar
j7n
Member
Member
Posts: 168
Joined: 2005-08-07, 21:56 UTC

Post by *j7n »

Perhaps automatic selection of encoding could be done for text files where it is unambiguous: HTML and XML files with a declared encoding, and other files that begin with a UTF byte order mark. Then maybe search could also use the right declared "ansi" encoding instead of the system default.
#148174 Personal license
Running Total Commander v8.52a
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

2j7n and MVV
Well, that is how things work for many (most) text editors out there for years now and also how I implemented my PCREsearch plug-in (including a configurable analysis buffer, but except the HTML/XML part, as it's superfluously), so no need to reinvent the wheal.
And like I said: I'm pretty sure that Christian already does similar things in the automatic detection for TC's CBC. He'd just need to port this detection mechanism to the search function, maybe tune it a bit and maybe also make the data analysis buffer for the detection user configurable (ini option) and then we'd have it.
TC plugins: PCREsearch and RegXtract
User avatar
MVV
Power Member
Power Member
Posts: 8702
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

There is no chance to miss anything in Lister, but there is one in Find Files, so this feature should only be used on pure text files.

And, it shouldn't be configurable via INI, it should be another search encoding option (ANSI, ASCII, UTF-16LE, UTF-8, Auto-Detect Encoding).
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

MVV wrote:And, it shouldn't be configurable via INI
I said:
make the data analysis buffer for the detection user configurable (ini option)
so of course not the option itself.
MVV wrote:it should be another search encoding option (ANSI, ASCII, UTF-16LE, UTF-8, Auto-Detect Encoding)
Yes, exactly like that, and like I said: this might work as a mutual exclusion option, i.e. all other text encoding options are disabled when automatic detection is enabled and vice versa.
MVV wrote:so this feature should only be used on pure text files
Again: I already explained that this would be completely optional for the user, so if he chooses to use this option, he will find only ANSI strings for non-text files, but is of course still free to use the usual encoding options that we have now instead (if the results weren't satisfactory). Besides that, he is free to pre-filter file types through the other search options anyway.
TC plugins: PCREsearch and RegXtract
User avatar
Lefteous
Power Member
Power Member
Posts: 9535
Joined: 2003-02-09, 01:18 UTC
Location: Germany
Contact:

Post by *Lefteous »

2ghisler(Author)
Some files like EXE can contain a mix of UTF-8, UTF-16, plain Ansi etc. It wouldn't be a good idea to rely on auto-detect, that would only work with a small group of files.
In Lister there seems to be a mechanism which is used to find out if the binary or text mode is used. I think something similar should be used here too. For binary files (which are not handled by a plugin) all encodings make sense - so all should be used. I don't see how the user could decide which not use. I see the use case that binary files should not be searched at all though.
For text files it auto-detect be used.
Post Reply