Problems w/ saving improper UTF-8 based chars for MRT

AntonyD · Post by *AntonyD » 2025-05-21, 11:07 UTC

During the testing of another bug, I came across really incorrect file names that I wanted to 'translate' into the correct names.
The problem was that these were downloaded archives with broken encoding of names upon saving.
So instead of the actually correct UTF-8 characters, their equivalents in encoding 1251 were used.
I meant that instead of the Russian letter A, the character Рђ was used.
This can be obtained when, for example, we save this letter in a text file with UTF-8 encoding, but when viewing with Lister
we intentionally break the viewing mode, turn on "Text Only" - we'll get the ANSI-Windows mode and use the 1251 encoding
for viewing - that's when we will see this "funny" symbol.
So I created all letters from А to Я (A-Z analogy) in their "broken" form - and put this in a field "Search" of MRT.
And for the field "Rename" and prepared and put correct equivalents of these letters.
For an example I will show you first 9 pairs of those letters:
Search: Рђ|Р‘|Р’|Р“|Р”|Р•|РЃ|Р–|Р—
Rename: А|Б|В|Г|Д|Е|Ё|Ж|З

I applied this sequence for testing on one file, made sure that everything worked perfectly - and then I saved it as a template for further use.
MRT utility has such functionality as we know. I named the template 'bad utf8->good 1251'. After that, I closed the utility and gathered all
the necessary files for batch renaming in one panel. I selected them all, invoked the MRT, chose the newly created template and ......
And I saw that the first line - the data in the "Search" field - Total decided to 'correct' it and presented those characters in their correct form.
So I saw this:
Search: А|Б|В|Г|Д|Е|Ё|Ж|З
Rename: А|Б|В|Г|Д|Е|Ё|Ж|З

As we understand it, this is a false representation of the specially 'incorrectly' specified character sequences.
I checked how it was saved in the wincmd.ini file - it has ANSI (1251) encoding. This is the default behavior.
I didn't do anything for this. In it this template is recorded the way I need it:
[rename]
bad utf8->good 1251_name=[N]
bad utf8->good 1251_ext=[E]
bad utf8->good 1251_search=Рђ|Р‘|Р’|Р“|Р”|Р•|РЃ|Р–|Р—
bad utf8->good 1251_replace=А|Б|В|Г|Д|Е|Ё|Ж|З
bad utf8->good 1251_params=0|1|1|1|0|0|0|0|0

But as we understand - Total, when trying to read the value of the string for the Search field, performs an unnecessary transformation.
It decodes the read value as UTF-8 and ends up with correct Russian letters.
BUT I don't need that! I need to keep these 'funny' symbols as they are!

Post by *ghisler(Author) » 2025-05-21, 12:46 UTC

The only way to avoid such automatic UTF-8 code conversions is to save the ini file as UTF-16.

AntonyD · Post by *AntonyD » 2025-05-21, 16:49 UTC

1) And how should a user, without visiting this forum, suddenly realize that his problems, which he received, lie in the plane of "for some reason,
ONLY NOW the encoding is incorrect"? After all, UP to THIS point in time, he had not had a single problem with the contents of this file!!!
So everything was fine - Until now... And then suddenly, somehow, we have to guess that it needs to be converted forcibly?
SO why isn't it done AUTOMATICALLY? so that the user doesn't get confused at all in this situation?

2) Good people found that in version 9.51rc3 everything worked WITHOUT the need to re-code anything. But in version 9.51rc4,
we already have to do this. Could you just roll back that change and return the correct behavior without requiring some kind of
INI file conversion, please?

Post by *ghisler(Author) » 2025-05-22, 07:43 UTC

Sorry, I will not change that. It would break a lot more than it would fix.

Moderator message from: ghisler(Author) » 2025-05-22, 07:43 UTC

Moved to will not be changed

AntonyD · Post by *AntonyD » 2025-05-22, 07:50 UTC

Can you just explain why this INI file is not saved IMMEDIATELY in a MORE stable encoding?
Why do we have to do something manually? Why is UTF-16LE not the default?

And the first part of the question is also more universal, rather than a technical requirement to do something.
HOW can we guess that the encoding is the problem at this stage of product development?
Maybe then you can at least provide some explanations in the help for these fields - even better
in the TooltipS above these fields?

And in the second part, the breaking point of 9.51rc4 behavior is generally indicated in plain text.
Don't you really want to explain WHAT was done and for what purpose in that version that broke
and is still in this state?

Post by *ghisler(Author) » 2025-05-23, 07:26 UTC

Why is UTF-16LE not the default?

1. Because when writing to ini files and the ini file doesn't exist yet, Windows automatically creates an ANSI version
2. Because of compatibility with Windows 9x/ME, e.g. when using portable TC or multi-boot

AntonyD · Post by *AntonyD » 2025-05-23, 08:22 UTC

When a program is run on some version of Windows, Total can always find out for sure this version.
This means that from the point of view of programming the code logic, it is always possible to fulfill
the condition of Encoding conversion of an INI file to a more stable format simply by "asking yourself":
what version of Windows was launched in?
According to the answer, Total can decide on automatic conversion to UTF-16LE.
THERE is NO reason to make the user think about it.

Well, the other two blocks remained without comments for some reason.
Although they are perhaps the most important in this discussion...

And the first part of the question is also more universal, rather than a technical requirement to do something.
HOW can we guess that the encoding is the problem at this stage of product development?
Maybe then you can at least provide some explanations in the help for these fields - even better
in the TooltipS above these fields?

And in the second part, the breaking point of 9.51rc4 behavior is generally indicated in plain text.
Don't you really want to explain WHAT was done and for what purpose in that version that broke
and is still in this state?

Post by *white » 2025-05-24, 14:05 UTC

AntonyD asked me twice for a response in this thread. I have looked into it now and here is my response.

I was able to confirm that the behavior changed in TC 9.51rc4. I think because of this change:

HISTORY.TXT wrote: 04.03.20 Fixed: New option in regional settings "Beta: Use Unicode UTF-8 for worldwide language support": Read both UTF-8 and ANSI values from wincmd.ini (32/64)

This is my interpretation:
By default, TC uses ANSI for storing wincmd.ini. Values are stored wincmd.ini as ANSI or when necessary as UTF-8 with BOM. But the Windows beta option to set the system's codepage to 65001, causes a problem. Values in the wincmd.ini (ANSI format) may have been stored as UTF-8 without BOM. I assume the change in TC 9.51rc4 is that TC tries to read the value as UTF-8 with ANSI as fallback. That way it wouldn't matter if the user for example temporarily enabled the Windows beta option and later in time disabled it.

The problem with this is that valid ANSI can be valid UTF-8 at the same time. TC could take the wrong value, as is the case presented in this thread. Perhaps a solution to this is to make a change when storing the value. After TC stores the value to wincmd.ini, it could retrieve the value the way it would when loading the value (attempting to interpret it as UTF-8), and check if the retrieved value differs from the value that was supposed to be stored. If so, TC could store the value with BOM instead.

AntonyD · Post by *AntonyD » 2025-05-24, 18:30 UTC

2white
Thank you!

Post by *ghisler(Author) » 2025-05-25, 09:48 UTC

Yes, that's exactly the problem: Users may have enabled "Beta: Use Unicode UTF-8 for worldwide language support", saved some things to the wincmd.ini, and then disabled that option. Therefore when TC sees valid UTF-8 in a key, it converts it from UTF-8 to UTF-16 and not from Ansi to UTF-16. I don't currently see any way to better handle this.

Post by *white » 2025-05-25, 11:21 UTC

ghisler(Author) wrote: 2025-05-25, 09:48 UTC I don't currently see any way to better handle this.

What about my suggestion to maintain the current reading behavior for wincmd.ini, but adjust the writing process? Specifically, to switch to UTF-8 with BOM for values where the ANSI-encoded string would be misinterpreted as valid UTF-8 under the currently used OS.

AntonyD · Post by *AntonyD » 2025-05-25, 12:34 UTC

2ghisler(Author)
you could introduce a simple checkbox "RAW data" - which should mean that IF user enters such chars - and
puts the check mark in this checkbox - he for 100% sure knows that inside your INI files these chars should be saved
precisely AS-IS.

And then you can use whatever methods you want, but the next time you read the data from this file, the user just
needs to see EXACTLY the SAME characters in the same fields as he entered.

And why you still "do not have?" a some assumptions? about the next my paragraphs of thoughts)))

And the first part of the question is also more universal, rather than a technical requirement to do something.
HOW can we guess that the encoding is the problem at this stage of product development?
Maybe then you can at least provide some explanations in the help for these fields - even better
in the TooltipS above these fields?

And in the second part, the breaking point of 9.51rc4 behavior is generally indicated in plain text.
Don't you really want to explain WHAT was done and for what purpose in that version that broke
and is still in this state?

the main things are marked

Or did the colleague white guess the second paragraph answer correctly?

Total Commander

Problems w/ saving improper UTF-8 based chars for MRT

Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT

Re: Problems w/ saving improper UTF-8 based chars for MRT