Bug in Cyrillic UTF-8

Bug reports will be moved here when the described bug has been fixed

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
LonerD
Senior Member
Senior Member
Posts: 381
Joined: 2010-06-19, 20:18 UTC
Location: Makeyevka, Russia
Contact:

Bug in Cyrillic UTF-8

Post by *LonerD »

TC x32 8.00b16
Win7SP1 x64 Eng

Image: http://i30.fastpic.ru/big/2012/0116/c7/9255b4fbac02fe498020cf99c65488c7.png

First document on screen saved in ANSI displayed proper, second - in UTF-8.
Last edited by LonerD on 2012-01-20, 16:43 UTC, edited 1 time in total.
"I used to feel guilty in Cambridge that I spent all day playing games, while I was supposed to be doing mathematics. Then, when I discovered surreal numbers, I realized that playing games IS math." John Horton Conway
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48097
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Does the UTF-8 file have a byte order marker (BOM)?

You can check it by viewing the file with F3. Then press '1' to see the plain text.
Author of Total Commander
https://www.ghisler.com
User avatar
LonerD
Senior Member
Senior Member
Posts: 381
Joined: 2010-06-19, 20:18 UTC
Location: Makeyevka, Russia
Contact:

Post by *LonerD »

When I open file in Lister:
1 - Plain Text - wrong symbols
7 - UTF-8 - normal letters
"I used to feel guilty in Cambridge that I spent all day playing games, while I was supposed to be doing mathematics. Then, when I discovered surreal numbers, I realized that playing games IS math." John Horton Conway
User avatar
sqa_wizard
Power Member
Power Member
Posts: 3864
Joined: 2003-02-06, 11:41 UTC
Location: Germany

Post by *sqa_wizard »

Well, to ask more detailed:

1. view the file with F3
2. press '1' to see the plain text
3. Have a look at the first 3 characters

Do they look like this?

Code: Select all


If yes, this is the BOM, which marks this file clearly as "UTF-8"
#5767 Personal license
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48097
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

You no longer need to check it, I will add support for both types (with or without BOM) to the next beta.
Author of Total Commander
https://www.ghisler.com
User avatar
LonerD
Senior Member
Senior Member
Posts: 381
Joined: 2010-06-19, 20:18 UTC
Location: Makeyevka, Russia
Contact:

Post by *LonerD »

ghisler(Author)
No changes in beta17 ?
Do they look like this?
in beta17 it look:
http://i28.fastpic.ru/big/2012/0120/39/7051875e18bae325fcee2366b3d02239.png
"I used to feel guilty in Cambridge that I spent all day playing games, while I was supposed to be doing mathematics. Then, when I discovered surreal numbers, I realized that playing games IS math." John Horton Conway
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48097
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

No changes in beta17 ?
Actually UTF-8 is now suppored by the internal text to thumbnail converter! I guess that you have some Lister plugin installed which does the conversion instead of TC, and does it wrong.

Please try this: Go to menu Configuration - Options - Thumbnails, and turn off all methods except for the last (text preview). Then switch thumbs view on. You may need to right click on the thumb and choose to re-load it.
Author of Total Commander
https://www.ghisler.com
Kevlar
Junior Member
Junior Member
Posts: 32
Joined: 2009-03-30, 13:11 UTC

Post by *Kevlar »

On my PC beta16 rendered thumbnails of txt file with cyrillic UTF-8 content wrongly. Beta 17 rendered thumbnail of the same file correctly. Beta 17a doing good too.
User avatar
LonerD
Senior Member
Senior Member
Posts: 381
Joined: 2010-06-19, 20:18 UTC
Location: Makeyevka, Russia
Contact:

Post by *LonerD »

in beta17 it look:
Oh, it's my mistake.

Issue has been resolved.
In beta 17-17a all shown right.
Thanks.
"I used to feel guilty in Cambridge that I spent all day playing games, while I was supposed to be doing mathematics. Then, when I discovered surreal numbers, I realized that playing games IS math." John Horton Conway
User avatar
Alextp
Power Member
Power Member
Posts: 2321
Joined: 2004-08-16, 22:35 UTC
Location: Russian Federation
Contact:

Post by *Alextp »

2Ghisler

Christian, could u donate UTF-detect code to my open source SynWrite?
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48097
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Sure, it's quite simple, see function IsBufferUtf8 below. PartialAllowed must be set to true if the buffer is smaller than the file.

Code: Select all

const bytesFromUTF8:array[char] of byte = (
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 32
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 64
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 96
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //128
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //160
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //192
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  //224
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5); //256

function GetUtf8CharWidth(firstchar:char):integer;
begin
  result:=bytesFromUTF8[firstchar]+1;
end;

function IsFirstUTF8Char(thechar:char):boolean;
{The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.}
begin
  result:=(byte(thechar) and (128+64))<>128;
end;

function IsSecondaryUTF8Char(thechar:char):boolean;
{The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.}
begin
  result:=(byte(thechar) and (128+64))=128;
end;

function IsBufferUtf8(buf:pchar;PartialAllowed:boolean):boolean;
{Buffer contains only valid UTF-8 characters, no secondary alone, no primary without the correct nr of secondary}
var p:pchar;
    utf8bytes:integer;
    hadutf8bytes:boolean;
begin
  p:=buf;
  hadutf8bytes:=false;
  result:=false;
  utf8bytes:=0;
  while p[0]<>#0 do begin
    if utf8bytes>0 then begin  {Expecting secondary char}
      hadutf8bytes:=true;
      if not IsSecondaryUTF8Char(p[0]) then exit;  {Fail!}
      dec(utf8bytes);
    end else if IsFirstUTF8Char(p[0]) then
      utf8bytes:=GetUtf8CharWidth(p[0])-1
    else if IsSecondaryUTF8Char(p[0]) then
      exit;  {Fail!}
    inc(p);
  end;
    result:=hadutf8bytes and (PartialAllowed or (utf8bytes=0));
end;
Author of Total Commander
https://www.ghisler.com
Post Reply