Bug in Cyrillic UTF-8

LonerD · Post by *LonerD » 2012-01-16, 14:31 UTC

TC x32 8.00b16
Win7SP1 x64 Eng

Image: http://i30.fastpic.ru/big/2012/0116/c7/9255b4fbac02fe498020cf99c65488c7.png

First document on screen saved in ANSI displayed proper, second - in UTF-8.

Post by *ghisler(Author) » 2012-01-16, 15:52 UTC

Does the UTF-8 file have a byte order marker (BOM)?

You can check it by viewing the file with F3. Then press '1' to see the plain text.

LonerD · Post by *LonerD » 2012-01-17, 01:25 UTC

When I open file in Lister:
1 - Plain Text - wrong symbols
7 - UTF-8 - normal letters

sqa_wizard · Post by *sqa_wizard » 2012-01-17, 20:09 UTC

Well, to ask more detailed:

1. view the file with F3
2. press '1' to see the plain text
3. Have a look at the first 3 characters

Do they look like this?

Code: Select all

ï»¿

If yes, this is the BOM, which marks this file clearly as "UTF-8"

Post by *ghisler(Author) » 2012-01-18, 15:41 UTC

You no longer need to check it, I will add support for both types (with or without BOM) to the next beta.

LonerD · Post by *LonerD » 2012-01-20, 17:03 UTC

ghisler(Author)
No changes in beta17 ?

Do they look like this?

in beta17 it look:
http://i28.fastpic.ru/big/2012/0120/39/7051875e18bae325fcee2366b3d02239.png

Post by *ghisler(Author) » 2012-01-22, 08:54 UTC

No changes in beta17 ?

Actually UTF-8 is now suppored by the internal text to thumbnail converter! I guess that you have some Lister plugin installed which does the conversion instead of TC, and does it wrong.

Please try this: Go to menu Configuration - Options - Thumbnails, and turn off all methods except for the last (text preview). Then switch thumbs view on. You may need to right click on the thumb and choose to re-load it.

Kevlar · Post by *Kevlar » 2012-01-22, 12:39 UTC

On my PC beta16 rendered thumbnails of txt file with cyrillic UTF-8 content wrongly. Beta 17 rendered thumbnail of the same file correctly. Beta 17a doing good too.

LonerD · Post by *LonerD » 2012-01-25, 11:30 UTC

in beta17 it look:

Oh, it's my mistake.

Issue has been resolved.
In beta 17-17a all shown right.
Thanks.

Alextp · Post by *Alextp » 2012-01-25, 11:57 UTC

2Ghisler

Christian, could u donate UTF-detect code to my open source SynWrite?

Post by *ghisler(Author) » 2012-01-25, 16:13 UTC

Sure, it's quite simple, see function IsBufferUtf8 below. PartialAllowed must be set to true if the buffer is smaller than the file.

Code: Select all

const bytesFromUTF8:array[char] of byte = (
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 32
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 64
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  // 96
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //128
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //160
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,  //192
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,  //224
  2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5); //256

function GetUtf8CharWidth(firstchar:char):integer;
begin
  result:=bytesFromUTF8[firstchar]+1;
end;

function IsFirstUTF8Char(thechar:char):boolean;
{The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.}
begin
  result:=(byte(thechar) and (128+64))<>128;
end;

function IsSecondaryUTF8Char(thechar:char):boolean;
{The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.}
begin
  result:=(byte(thechar) and (128+64))=128;
end;

function IsBufferUtf8(buf:pchar;PartialAllowed:boolean):boolean;
{Buffer contains only valid UTF-8 characters, no secondary alone, no primary without the correct nr of secondary}
var p:pchar;
    utf8bytes:integer;
    hadutf8bytes:boolean;
begin
  p:=buf;
  hadutf8bytes:=false;
  result:=false;
  utf8bytes:=0;
  while p[0]<>#0 do begin
    if utf8bytes>0 then begin  {Expecting secondary char}
      hadutf8bytes:=true;
      if not IsSecondaryUTF8Char(p[0]) then exit;  {Fail!}
      dec(utf8bytes);
    end else if IsFirstUTF8Char(p[0]) then
      utf8bytes:=GetUtf8CharWidth(p[0])-1
    else if IsSecondaryUTF8Char(p[0]) then
      exit;  {Fail!}
    inc(p);
  end;
    result:=hadutf8bytes and (PartialAllowed or (utf8bytes=0));
end;