Is there any WDX can verify the txt file encoding

fanzhixin · Post by *fanzhixin » 2010-01-29, 05:31 UTC

There are many txt files have different encoding, ANSI, UTF-8(with or without BOM), UTF-16, I need to pick them out. Is there any WDX can help? I can't find. I want to make one, but don't know how to.

is there anybody can give a clue? Many thank!~~

MVV · Post by *MVV » 2010-01-29, 07:52 UTC

Unicode (UTF-16) is quite easy to detect - each English character will have null character after it. So if you see at least one null character in text file - it seems to be Unicode-encoded.

UTF-8 detection requires more intellectual algorithm. English characters (and other from first half of ANSI codepage) are encoded in same way in both ANSI and UTF-8. So here you must know text language if you want to check file encoding - I think you may count characters of that language in ANSI and UTF-8 (in UTF-8 each character in second half of ANSI codepage will be coded using 2-bytes sequence) or to seek wrong UTF-8 characters. Also algorithm may seek logical words of text's language - this may help to determine encoding.
Also algo may seek sequences of characters with codes >127 and try to decode them from UTF-8 - if possible, it may be UTF-8. And, you may check characters from second half of codepage - there are some rules for UTF-8 standard:

two-byte coded character: 110xxxxx 10xxxxxx
three-byte coded character: 1110xxxx 10xxxxxx 10xxxxxx
four-byte coded character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

or (w/o masks, just bands)

two-byte coded character: 192..223 128..191
three-byte coded character: 224..239 128..191 128..191
four-byte coded character: 240..247 128..191 128..191 128..191

So, if algo can't find byte sequences that doesn't match theese masks, it may treat file as UTF-8-encoded.

van Dusen · Post by *van Dusen » 2010-01-29, 10:54 UTC

2fanzhixin
For a solution using Lev Freidins Script Content Plugin (CheckEncoding.vbs) see this posting