Is there any WDX can verify the txt file encoding

Discuss and announce Total Commander plugins, addons and other useful tools here, both their usage and their development.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
fanzhixin
Junior Member
Junior Member
Posts: 4
Joined: 2009-08-27, 14:23 UTC

Is there any WDX can verify the txt file encoding

Post by *fanzhixin »

There are many txt files have different encoding, ANSI, UTF-8(with or without BOM), UTF-16, I need to pick them out. Is there any WDX can help? I can't find. I want to make one, but don't know how to. :? is there anybody can give a clue? Many thank!~~
User avatar
MVV
Power Member
Power Member
Posts: 8711
Joined: 2008-08-03, 12:51 UTC
Location: Russian Federation

Post by *MVV »

Unicode (UTF-16) is quite easy to detect - each English character will have null character after it. So if you see at least one null character in text file - it seems to be Unicode-encoded.

UTF-8 detection requires more intellectual algorithm. English characters (and other from first half of ANSI codepage) are encoded in same way in both ANSI and UTF-8. So here you must know text language if you want to check file encoding - I think you may count characters of that language in ANSI and UTF-8 (in UTF-8 each character in second half of ANSI codepage will be coded using 2-bytes sequence) or to seek wrong UTF-8 characters. Also algorithm may seek logical words of text's language - this may help to determine encoding.
Also algo may seek sequences of characters with codes >127 and try to decode them from UTF-8 - if possible, it may be UTF-8. And, you may check characters from second half of codepage - there are some rules for UTF-8 standard:
two-byte coded character: 110xxxxx 10xxxxxx
three-byte coded character: 1110xxxx 10xxxxxx 10xxxxxx
four-byte coded character: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
or (w/o masks, just bands)
two-byte coded character: 192..223 128..191
three-byte coded character: 224..239 128..191 128..191
four-byte coded character: 240..247 128..191 128..191 128..191
So, if algo can't find byte sequences that doesn't match theese masks, it may treat file as UTF-8-encoded.
User avatar
van Dusen
Power Member
Power Member
Posts: 684
Joined: 2004-09-16, 19:30 UTC
Location: Sinzig (Rhein), Germany

Post by *van Dusen »

2fanzhixin
For a solution using Lev Freidins Script Content Plugin (CheckEncoding.vbs) see this posting
Post Reply