Error in file format detection in CompareByContent

Please report only one bug per message!

Moderators: white, Hacker, petermad, Stefan2

User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Error in file format detection in CompareByContent

Post by *petermad »

To reproduce:
Compare the history.txt file from TC 9.22 with any other text file, for example keyboard.txt - the history.txt file is detected as UTF-8 - open the file in Lister - it is not detected as UTF-8

Up until TC 9.0a rc2 this history.txt file is detected as ANSI, so the "bug" occurs the first time in TC 9.0a rc3.

If I delete the first 152 lines in history.txt - the last deleted line being:
17.06.18 Fixed: When using IgnoreDirErrors=1, switching to non-existent parent directory didn't work by double clicking on [..] entry (32/64)
then history.txt is detected as ANSI

But it is not that line alone that does it, because if I only delete that line, the file is still detected as UFT-8, as it also is if I only delete the first 151 lines.
Last edited by petermad on 2019-03-27, 22:39 UTC, edited 1 time in total.
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48083
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Error in file format detection in CompareByContents

Post by *ghisler(Author) »

The actual reason is the following string:
<META content="text/html; charset=utf-8; encoding=utf-8" http-equiv="Content-Type">

This is one of the methods used to detect UTF-8 files.
Author of Total Commander
https://www.ghisler.com
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

Deleting <META content="text/html; charset=utf-8; encoding=utf-8" http-equiv="Content-Type"> from the file solves the problem

Interesting though that deleting a certain number of lines BEFORE the line with the META tag, also makes TC recognize the file as ANSI - how can that be????
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Error in file format detection in CompareByContents

Post by *Usher »

Maybe the deleted lines contain other HTML tags?
Andrzej P. Wozniak
Polish subforum moderator
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

Maybe the deleted lines contain other HTML tags?
As I wrote if I only delete the first 151 lines the file is still detected as UTF-8, it is when I delete line 152 as well it turns to be detected as ANSI - and line 152:
17.06.18 Fixed: When using IgnoreDirErrors=1, switching to non-existent parent directory didn't work by double clicking on [..] entry (32/64)
does not contain HTML tags.

And the line:
08.12.16 Fixed: Lister, html view: <META content="text/html; charset=utf-8; encoding=utf-8" http-equiv="Content-Type"> was not recognized as UTF-8 encoding header (32/64)
is line 770, so it is still present after I have deleted the fist 152 lines, so why is the file then detected as ANSI, despite the presence of the former line 770.


To express it in another way, if the line:
08.12.16 Fixed: Lister, html view: <META content="text/html; charset=utf-8; encoding=utf-8" http-equiv="Content-Type"> was not recognized as UTF-8 encoding header (32/64)
is line number 770, the file is detected as UTF-8, but if it is line number 618, the file is detected as ANSI - that is what I find weird.

The first 152 line uses 19705 bytes - maybe that number of bytes before the META line matters.
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Error in file format detection in CompareByContents

Post by *Usher »

How did you get UTF-8 settings? I did try with clean ini and it always show ANSI<->ANSI button. What editor do you use?
Andrzej P. Wozniak
Polish subforum moderator
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

2Usher
Just use the history.txt as it comes with TC 9.22 - no editing. and compare it to for example TC's keyboard.txt - see http://madsenworld.dk/tcmd/comparecontent.png - this is with fresh ini-file
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48083
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Re: Error in file format detection in CompareByContents

Post by *ghisler(Author) »

Total Commander looks in the first read block (128 kBytes) of each file whether it
- contains valid UTF-8 codes
or
- contains valid UTF-8 encoding headers
but does NOT contain characters encoded in ANSI which are not valid UTF-8 codes.

The history file contains the following ANSI text beyond the 128kBytes limit:
äö.do.not.remove

So when you remove some lines, this text gets into the search range for the encoding detection function.
Author of Total Commander
https://www.ghisler.com
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

So when you remove some lines, this text gets into the search range for the encoding detection function.
I suspected something like that. So no bug, more like a coincidence - you can move this to "TC Behaviour which will not be changed"
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Error in file format detection in CompareByContents

Post by *Usher »

2petermad
I did it and all went OK as I said before. But you are using 64-bit TC, mine is 32-bit.
Edit: Didn't test "äö.do.not.remove" yet.
Andrzej P. Wozniak
Polish subforum moderator
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

2Usher
But you are using 64-bit TC, mine is 32-bit
Same result with 32bit TC: http://madsenworld.dk/tcmd/comparecontent2.png

Are you sure, you are testing with the right version af the history.txt file, you won't see the problem if your history.txt file is for Total Commander 9.20 release candidate 1 or older. The history.txt from TC 9.22 is 782460 bytes - the file has to be larger than around 762752 bytes to be detected as UTF-8. The history.txt from TC 9.20 rc1 is 761886 bytes - so here the file is detected as ANSI
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Error in file format detection in CompareByContents

Post by *Usher »

@petermad
I'm using TC 9.22 final, your screenshot is from some beta/RC…

Just use Windows XP ;-P
MS has changed Unicode detection in newer Windows as Notepad in Windows XP has problems with such a simple text like "Bush hid the facts":
https://en.wikipedia.org/wiki/Bush_hid_the_facts
Andrzej P. Wozniak
Polish subforum moderator
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContents

Post by *petermad »

2Usher
I'm using TC 9.22 final, your screenshot is from some beta/RC
It doesn't matter - I get the same in TC 9.22: http://madsenworld.dk/tcmd/comparecontent3.png and http://madsenworld.dk/tcmd/comparecontent4.png

Just use Windows XP
Same thing there: http://madsenworld.dk/tcmd/comparecontent5.png ;-) AFAIK TC uses it's own charset detection in CompareByContent


As i wrote earlier, this behavior goes all the way back to TC 9.0a rc3: http://madsenworld.dk/tcmd/comparecontent6.png - before that the current (v. 9.22) history.txt is detected as ANSI: http://madsenworld.dk/tcmd/comparecontent7.png

Could you maybe perhaps provide a similar screenshot of what you see on your system (clean ini file and comparing the history.txt from TC 9.22) ?
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
User avatar
Usher
Power Member
Power Member
Posts: 1675
Joined: 2011-03-11, 10:11 UTC

Re: Error in file format detection in CompareByContent

Post by *Usher »

2petermad
I suspect that the detection depends on Windows settings. If you use the same codepage as in the detection string (Windows-1252), TC will show UTF-8, but I use Windows-1250.
Or maybe that's something with other Windows language settings (for RTL, CJK, non-Unicode apps etc.). It looks like in my case Compare by Content always shows ANSI, even for wincmd.ini in Unicode UTF-16 + BOM.
I will do more tests in Windows XP, 7 and 10 using Windows-1250 codepage.
Andrzej P. Wozniak
Polish subforum moderator
User avatar
petermad
Power Member
Power Member
Posts: 14807
Joined: 2003-02-05, 20:24 UTC
Location: Denmark
Contact:

Re: Error in file format detection in CompareByContent

Post by *petermad »

I suspect that the detection depends on Windows settings. If you use the same codepage as in the detection string (Windows-1252), TC will show UTF-8, but I use Windows-1250.
Or maybe that's something with other Windows language settings (for RTL, CJK, non-Unicode apps etc.)
I seems like it.

I tried to set the "Language for programs that does not support Unicode" to Polish, but it didn't change anything for me. (Windows 7)

I also tried to change the "Script" in the Font setting of CompareByContent to something different (Central European - I didn't get the option af Eastern European), but I still get the same result. (Windows 7 & 10)
License #524 (1994)
Danish Total Commander Translator
TC 11.03 32+64bit on Win XP 32bit & Win 7, 8.1 & 10 (22H2) 64bit, 'Everything' 1.5.0.1371a
TC 3.50 on Android 6 & 13
Try: TC Extended Menus | TC Languagebar | TC Dark Help | PHSM-Calendar
Post Reply