?"Verify Checksums" error if sfv file is UTF-8 enc

soma01 · Post by *soma01 » 2014-05-29, 08:58 UTC

Verify check sums menu has "not found" error if sfv file is in UTF-8 and has no BOM.

With BOM:

history.txt.utf8.sfv:
OK: history-éáőöüűíó.txt

Errors: 0
OK: 1, not found: 0, read error: 0, wrong checksum: 0

Without BOM:

history.txt.utf8.sfv:
Cannot open input file history-Ă©ĂˇĹ‘Ă¶ĂĽĹ±ĂĂł.txt!

Errors: 1
OK: 0, not found: 1, read error: 0, wrong checksum: 0

I think it should work both cases.
I have not tested with MD5 and SHA1 files.

soma01 · Post by *soma01 » 2014-05-29, 09:05 UTC

More info about the BOM: google: utf 8 bom wikipedia ---> "UTF-8" section.

Post by *ghisler(Author) » 2014-05-29, 10:28 UTC

Sorry, UTF-8 SFV files without bom are not supported.

soma01 · Post by *soma01 » 2014-05-29, 12:13 UTC

Do you plan to support?

Is it difficult to bring to effect?

- Programs on non Windows system usually make UTF-8 text files without BOM.
- The sfv files can create, not only Windows system. (Linux)
- On Windows, some program (Notebad++) can be save both.
- The TC built in viewer can display correctly.
- Wikipedia:

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters ï»¿ for this.

The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.

The primary motivation for not using a BOM is possible backwards-compatibility with software that is not Unicode-aware. Often, a file encoded in UTF-8 is compatible with software that expects ASCII as long as it does not include a BOM. Examples include: a text file that only uses ASCII characters, a programming language that permits non-ASCII characters in certain free-text contexts (such as string literals or comments) but not elsewhere (such at the start of a file), a Unix shell that looks for a shebang at the start of a script. Another reason is that some programming environments, such as Java, are Unicode-aware but do not automatically handle the BOM, slightly complicating programming.

Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.

However, the argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode. Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM. In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

Post by *ghisler(Author) » 2014-06-02, 09:42 UTC

Well, TC could first try ANSI, and if no file is found, UTF-8. Or it could parse the entire file to check whether it is a valid UTF-8 file or not.