9.0b9 x64 - wincmd.ini encoding issues

milo1012 · Post by *milo1012 » 2016-08-12, 16:21 UTC

MVV wrote:and these BOMs may tell editors that file has UTF-8

Not just the BOMs, but also the UTF-8 byte sequence following the BOM

MVV wrote:but you will not see these BOMs because BOM has no visible representation

Not necessarily. The scintilla based editors will somehow alter the font rendering when putting a BOM in front of non-ASCII characters.
https://abload.de/img/bomtestdjuo0.png
https://abload.de/img/bomtest310kwj.png
I also noted this on some browsers: the character following a BOM is slightly "set off" (one or two pixels).

MVV wrote:I think that it is a bad idea nowadays to open files in Windows Notepad by default because of mentioned reasons

That's why TC IMO should clearly state that you need to honor the ANSI file encoding when manually editing the ini file, at least putting it in the help section "ini file Settings" (4.b).

MVV wrote:It is correct that TC uses Windows API that only support ANSI and UTF-16, UTF-8 is not supported, but TC may store some strings in UTF-8 with personal BOMs

I think we already clarified that.

Post by *ghisler(Author) » 2016-08-12, 16:38 UTC

Maybe your editor sees the UTF-8 BOM in the middle of the file, or the UTF-8 encoded characters, and assumes that the entire file is UTF-8 (which it is not). As others have written, the Windows INI file functions do not support UTF-8.

milo1012 · Post by *milo1012 » 2016-08-12, 16:48 UTC

2ghisler
The scenario is actually not far fetched: on a fresh ini search for a non-codepage string. Open the ini in an editor and it will detect is as UTF-8. Basically any editor that has encoding detection would see is as UTF-8.
Now users with no knowledge about the intended encoding might edit the ini and save it as UTF-8 with a prefixed BOM, or try to recode it to ANSI.
Therefore like I said before: wouldn't it be better to at least clearly state in the help file what encoding the ini file needs to have?

mag · Post by *mag » 2016-08-12, 19:44 UTC

There are actually 2 issues here as I see them

1. Mixing
a) strings with national characters that still fit into the system code page
and
b) strings with characters that don't
results in the INI file containing first group of strings being encoded in ANSI and second group being encoded in UTF-8 (locally - each such string if prefixed with BOM). That will confuse a lot of text editors. For example I often use PSPad and it can't handle that (it will process the whole file as UTF-8 encoded and the first group of strings will be malformed). You will need to find an editor that can.

2. The default editor for INI files is Windows Notepad. If we don't change that, the tcmd option "Configuration / Change Settings Files Directly" will open the INI file in the Notepad (tcmd doesn't respect its configured Editor here) and if that will consider the file being UTF-8 encoded (which is easy to achieve) it will add the UTF-8 BOM to its beginning upon saving and the resulting INI file will cause troubles.

Note that both issues would be solved if the INI file would be encoded in UTF-16 LE (ideally since the very beginning).

Post by *ghisler(Author) » 2016-08-16, 20:07 UTC

Unfortunately Windows does not create UTF-16 ini files by itself when just writing strings with INI functions.

mag · Post by *mag » 2016-08-17, 01:47 UTC

And can you work around that by adding a (perhaps optional, like it's already for changing the ini file location) conversion operation (to UTF-16 LE) into the installer for example?

MVV · Post by *MVV » 2016-08-17, 07:31 UTC

ghisler(Author),
I think you can simply create wincmd.ini yourself in UTF-16 encoding with any contents, e.g. such one:

Code: Select all

[Configuration]
test=0

So further API calls will work with this Unicode file.

Post by *ghisler(Author) » 2016-08-17, 19:34 UTC

I found a better workaround which also works with already existing ini files (as long as the byte order marker hasn't been added yet): TC now adds the following line:
SetEncoding=äö.do.not.remove

I use äö because the sequence gives valid dual byte characters also in dual byte languages like Chinese. In Cyrillic it would be дц, in Chinese 漩 etc. etc. Notepad sees this and does NOT switch to UTF-8 mode because it's not a valid UTF-8 sequence. That's how it was normally supposed to work: Users mainly search and save using their own language, strings from other codepages should be in the vast minority.