How to find detect Unicode UTF-16 UTF-8 txt files in search?

Hurdet · Post by *Hurdet » 2014-12-18, 22:50 UTC

How to find all txt file saved in unicode format?

milo1012 · Post by *milo1012 » 2014-12-19, 07:01 UTC

There is no completely reliable way to detect Unicode (UTF-16) files, since it relies on some statistics, but it works in most cases.
For UTF-8 it is more reliable, it might only fail for some (very) short text snippets.

The only plugins I'm aware of:
EncInfo
It fails sometimes, since it tries to guess for ANSI/OEM encoding, which is completely unreliable, because it depends on nothing but random statistics.

PCREsearch
I tuned my PCREsearch plugin quite a lot to prevent false detections.
It should now come close to popular editors like Notepad++ and similar programs that try to auto-detect encodings.
The default ini file already has an encoding check field
regex4type=-1

So just enable field names by setting
ModFieldName=yes
and you can use it in TC.

MVV · Post by *MVV » 2014-12-19, 07:19 UTC

I think UTF-16 is easy to detect: first 128 charset characters always have trailing zero byte. So any digit or a punctuation mark has zero byte after it.

But since TC supports searching in UTF-16, you can simply find all files containing e.g. space (or other character that should be in most text files) with checked UTF-16 option (only), or simply 20 00 in hex.

Even more, none of text files in other encodings may contain zero byte so search just 00 in hex.

Detection of UTF-8 is much harder.

milo1012 · Post by *milo1012 » 2014-12-19, 07:30 UTC

MVV wrote:you can simply find all files containing e.g. space

Nice guess, but what if you have only one line and only characters > U+00FF, like CJK text files?
Also you will detect a lot of binary files that way.

MVV wrote:none of text files in other encodings may contain zero byte

Nope, there is no law or rule that says so. A txt file can be anything.
UTF-8 and UTF-16 allow binary zeros w/o invalidating the encoding scheme.

MVV · Post by *MVV » 2014-12-19, 07:44 UTC

Nice guess, but what if you have only one line and only characters > U+00FF, like CJK text files?

Of course some files with only a few characters from higher Unicode bands may be missed. Well, slow checking is required here.

Also you will detect a lot of binary files that way.

Aren't you excluding binary files by extension? Some binary files may be treated as a correct Unicode files even if they don't contain really meaningful text but only a random set of Unicode characters from completely different bands.

Nope, there is no law or rule that says so. A txt file can be anything.
UTF-8 and UTF-16 allow binary zeros w/o invalidating the encoding scheme.

Text files don't contain null characters (or they are not text files anymore but binary ones), UTF-8 files don't contain zero bytes too: first bytes of sequences can't be zeroes because text files don't contain null characters, bytes that continue sequences may not be zero bytes too because they have some bits set. Nulls in ANSI or UTF-8 string is a terminating character and they are not placed into text files.

milo1012 · Post by *milo1012 » 2014-12-19, 08:01 UTC

MVV wrote:Of course some files with only a few characters from higher Unicode bands may be missed. Well, slow checking is required here.

So, that's why I don't see why you're suggesting such things in the first place.
An average user doesn't want to bother with hex or internal encoding schemes.

MVV wrote:Text files don't contain null characters (or they are not text files anymore but binary ones)
...
Nulls in ANSI or UTF-8 string is a terminating character and they are not placed into text files

Again: there is no rule that says so!

This is an old programmers assumption probably originating somewhere in the 70s/80s and some Unix worksations.
I've seen tons of configuration files for games and other software that use .txt file for storing some C-String style.

MVV wrote: UTF-8 files don't contain zero bytes too: first bytes of sequences can't be zeroes because text files don't contain null characters

No, see above.
ASCII is a subset of UTF-8, that's why binary zeros must be allowed at the first place.

MVV wrote:bytes that continue sequences may not be zero bytes too because they have some bits set

Cpt. obvious. Of course I'm not talking about mixing these zero bytes in open sequences.
But between closed sequences you can mix in zero bytes as much as you want.
Just open such files in Notepad2 and you'll see.

MVV · Post by *MVV » 2014-12-19, 08:11 UTC

You haven't specified what do you mean by text files BTW.

This is an old programmers assumption probably originating somewhere in the 70s/80s and some Unix worksations.
I've seen tons of configuration files for games and other software that use .txt file for storing some C-String style.

These are not text files but binary ones containing text. No pure UTF-8 text file can contain null character. And most text editors will kill nulls if you save such file.

milo1012 · Post by *milo1012 » 2014-12-19, 08:22 UTC

MVV wrote:You haven't specified what do you mean by text files BTW.

Well maybe because a text file isn't exactly a definition?
We're talking about interpreting raw bytes as characters, in opposite to a complex encoding scheme, like e.g. office files.

MVV wrote:No pure UTF-8 text file can contain null character.

How often do I have to repeat myself?
A null character is valid.
UTF-8 is an encoding scheme for Unicode, where control characters, and U+0000 is such a character, are completely valid.
I won't bother discussing this any further.

MVV wrote:And most text editors will kill nulls if you save such file

Maybe you should use other editors then.
Show my at least five. (not counting std. Windows Notepad)