
Is there any WDX can verify the txt file encoding
Moderators: Hacker, petermad, Stefan2, white
Is there any WDX can verify the txt file encoding
There are many txt files have different encoding, ANSI, UTF-8(with or without BOM), UTF-16, I need to pick them out. Is there any WDX can help? I can't find. I want to make one, but don't know how to.
is there anybody can give a clue? Many thank!~~

Unicode (UTF-16) is quite easy to detect - each English character will have null character after it. So if you see at least one null character in text file - it seems to be Unicode-encoded.
UTF-8 detection requires more intellectual algorithm. English characters (and other from first half of ANSI codepage) are encoded in same way in both ANSI and UTF-8. So here you must know text language if you want to check file encoding - I think you may count characters of that language in ANSI and UTF-8 (in UTF-8 each character in second half of ANSI codepage will be coded using 2-bytes sequence) or to seek wrong UTF-8 characters. Also algorithm may seek logical words of text's language - this may help to determine encoding.
Also algo may seek sequences of characters with codes >127 and try to decode them from UTF-8 - if possible, it may be UTF-8. And, you may check characters from second half of codepage - there are some rules for UTF-8 standard:
UTF-8 detection requires more intellectual algorithm. English characters (and other from first half of ANSI codepage) are encoded in same way in both ANSI and UTF-8. So here you must know text language if you want to check file encoding - I think you may count characters of that language in ANSI and UTF-8 (in UTF-8 each character in second half of ANSI codepage will be coded using 2-bytes sequence) or to seek wrong UTF-8 characters. Also algorithm may seek logical words of text's language - this may help to determine encoding.
Also algo may seek sequences of characters with codes >127 and try to decode them from UTF-8 - if possible, it may be UTF-8. And, you may check characters from second half of codepage - there are some rules for UTF-8 standard:
or (w/o masks, just bands)two-byte coded character: 110xxxxx 10xxxxxx
three-byte coded character: 1110xxxx 10xxxxxx 10xxxxxx
four-byte coded character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So, if algo can't find byte sequences that doesn't match theese masks, it may treat file as UTF-8-encoded.two-byte coded character: 192..223 128..191
three-byte coded character: 224..239 128..191 128..191
four-byte coded character: 240..247 128..191 128..191 128..191
2fanzhixin
For a solution using Lev Freidins Script Content Plugin (CheckEncoding.vbs) see this posting
For a solution using Lev Freidins Script Content Plugin (CheckEncoding.vbs) see this posting