How to find detect Unicode UTF-16 UTF-8 txt files in search?
Moderators: Hacker, petermad, Stefan2, white
How to find detect Unicode UTF-16 UTF-8 txt files in search?
How to find all txt file saved in unicode format?
There is no completely reliable way to detect Unicode (UTF-16) files, since it relies on some statistics, but it works in most cases.
For UTF-8 it is more reliable, it might only fail for some (very) short text snippets.
The only plugins I'm aware of:
EncInfo
It fails sometimes, since it tries to guess for ANSI/OEM encoding, which is completely unreliable, because it depends on nothing but random statistics.
PCREsearch
I tuned my PCREsearch plugin quite a lot to prevent false detections.
It should now come close to popular editors like Notepad++ and similar programs that try to auto-detect encodings.
The default ini file already has an encoding check field
regex4type=-1
So just enable field names by setting
ModFieldName=yes
and you can use it in TC.
For UTF-8 it is more reliable, it might only fail for some (very) short text snippets.
The only plugins I'm aware of:
EncInfo
It fails sometimes, since it tries to guess for ANSI/OEM encoding, which is completely unreliable, because it depends on nothing but random statistics.
PCREsearch
I tuned my PCREsearch plugin quite a lot to prevent false detections.
It should now come close to popular editors like Notepad++ and similar programs that try to auto-detect encodings.
The default ini file already has an encoding check field
regex4type=-1
So just enable field names by setting
ModFieldName=yes
and you can use it in TC.
TC plugins: PCREsearch and RegXtract
I think UTF-16 is easy to detect: first 128 charset characters always have trailing zero byte. So any digit or a punctuation mark has zero byte after it.
But since TC supports searching in UTF-16, you can simply find all files containing e.g. space (or other character that should be in most text files) with checked UTF-16 option (only), or simply 20 00 in hex.
Even more, none of text files in other encodings may contain zero byte so search just 00 in hex.
Detection of UTF-8 is much harder.
But since TC supports searching in UTF-16, you can simply find all files containing e.g. space (or other character that should be in most text files) with checked UTF-16 option (only), or simply 20 00 in hex.
Even more, none of text files in other encodings may contain zero byte so search just 00 in hex.
Detection of UTF-8 is much harder.
Nice guess, but what if you have only one line and only characters > U+00FF, like CJK text files?MVV wrote:you can simply find all files containing e.g. space
Also you will detect a lot of binary files that way.
Nope, there is no law or rule that says so. A txt file can be anything.MVV wrote:none of text files in other encodings may contain zero byte
UTF-8 and UTF-16 allow binary zeros w/o invalidating the encoding scheme.
TC plugins: PCREsearch and RegXtract
Of course some files with only a few characters from higher Unicode bands may be missed. Well, slow checking is required here.Nice guess, but what if you have only one line and only characters > U+00FF, like CJK text files?
Aren't you excluding binary files by extension? Some binary files may be treated as a correct Unicode files even if they don't contain really meaningful text but only a random set of Unicode characters from completely different bands.Also you will detect a lot of binary files that way.
Text files don't contain null characters (or they are not text files anymore but binary ones), UTF-8 files don't contain zero bytes too: first bytes of sequences can't be zeroes because text files don't contain null characters, bytes that continue sequences may not be zero bytes too because they have some bits set. Nulls in ANSI or UTF-8 string is a terminating character and they are not placed into text files.Nope, there is no law or rule that says so. A txt file can be anything.
UTF-8 and UTF-16 allow binary zeros w/o invalidating the encoding scheme.
So, that's why I don't see why you're suggesting such things in the first place.MVV wrote:Of course some files with only a few characters from higher Unicode bands may be missed. Well, slow checking is required here.
An average user doesn't want to bother with hex or internal encoding schemes.
Again: there is no rule that says so!MVV wrote:Text files don't contain null characters (or they are not text files anymore but binary ones)
...
Nulls in ANSI or UTF-8 string is a terminating character and they are not placed into text files
This is an old programmers assumption probably originating somewhere in the 70s/80s and some Unix worksations.
I've seen tons of configuration files for games and other software that use .txt file for storing some C-String style.
No, see above.MVV wrote: UTF-8 files don't contain zero bytes too: first bytes of sequences can't be zeroes because text files don't contain null characters
ASCII is a subset of UTF-8, that's why binary zeros must be allowed at the first place.
Cpt. obvious. Of course I'm not talking about mixing these zero bytes in open sequences.MVV wrote:bytes that continue sequences may not be zero bytes too because they have some bits set
But between closed sequences you can mix in zero bytes as much as you want.
Just open such files in Notepad2 and you'll see.
TC plugins: PCREsearch and RegXtract
You haven't specified what do you mean by text files BTW.
These are not text files but binary ones containing text. No pure UTF-8 text file can contain null character. And most text editors will kill nulls if you save such file.This is an old programmers assumption probably originating somewhere in the 70s/80s and some Unix worksations.
I've seen tons of configuration files for games and other software that use .txt file for storing some C-String style.
Well maybe because a text file isn't exactly a definition?MVV wrote:You haven't specified what do you mean by text files BTW.
We're talking about interpreting raw bytes as characters, in opposite to a complex encoding scheme, like e.g. office files.
How often do I have to repeat myself?MVV wrote:No pure UTF-8 text file can contain null character.
A null character is valid.
UTF-8 is an encoding scheme for Unicode, where control characters, and U+0000 is such a character, are completely valid.
I won't bother discussing this any further.
Maybe you should use other editors then.MVV wrote:And most text editors will kill nulls if you save such file
Show my at least five. (not counting std. Windows Notepad)
TC plugins: PCREsearch and RegXtract