Verify checksum: bad performance

Post by *ghisler(Author) » 2012-06-25, 13:27 UTC

For test purposes, I'm usingMarcinW's suggested method to get the file size now, but only in the 64-bit version. That's the "running dots" you see. But the 8.0 version still used the FindFirstFile method!

Maybe you just saw the operating system caching the directory?

umbra · Post by *umbra » 2012-06-25, 16:14 UTC

2ghisler(Author)
I know what the "running dots" actually means, I wrote it in my first post. I'm just using the term because it's simple and obvious (it describes exactly what one sees).

I retested all versions on a different computer with Windows 8 x64 (the previous test was on Windows 7 x64). This time, before each try, I restarted computer to get rid of caching. But the results are basically the same:

Code: Select all

TC version     "running dots" [s]     whole verification [s]
8.0 x32                   435                        955
8.01 x32                  420                        905
8.0 x64                     9                        130
8.01 x64                    7                        115

Sob · Post by *Sob » 2012-06-25, 22:26 UTC

I see the same thing. In 32-bit TC, the dots are definitely not running. They are crawling at best.

I tried it without restarting. Just to measure TC's raw dot-counting power. ;) So I did not measure the actual checksum computing, just the "running dots" part this time. Everything was cached nicely in memory, so that moody harddrive couldn't influence anything. I repeated it few times and took the best results (it needed few runs to convince Windows to really keep everything cached).

For sfv file with 50.000 lines it was 100/5 seconds (32-bit/64-bit). With less lines it was relatively uninteresting for 64-bit version (I mean, how faster it can get, when starting with already fast 5 seconds), but for 32-bit version it was different:
50.000 lines took 100 seconds, 40.000 lines 63 seconds, 30.000 lines 42 seconds, 20.000 lines 17 seconds and 10.000 lines only 4 seconds. Another interesting thing, at the beginning the dots appear slowly, but it gets faster and faster and at the end they fly almost as fast as in 64-version. And for the whole time, the bottleneck is CPU, TC is squeezing maximum out of one core.

There is no difference between 8.0 and 8.01rc1. I can go back as far as to 8.0beta1 and it's always the same, 64-bit fast, 32-bit horribly slow (even 7.56a is the same).

Post by *ghisler(Author) » 2012-06-28, 13:32 UTC

Strange, TC 8.01 is doing the counting and refreshing now in different threads - I see a difference like 45 seconds vs. 30 seconds between TC 8 and TC 8.01...

Sob · Post by *Sob » 2012-06-28, 19:02 UTC

Try the following simple test. Unpack the archive to get sfv files with 10, 20, 40 and 80 thousands lines. All lines reference single file (TC not checking for duplicates works great for this purpose). Start the verification and watch the clock. With 64-bit TC, the amount of time needed for "dot-counting phase" is linear. About 5 seconds for test 80, half for 40, etc.. But with 32-bit TC I need 2, 6, 22 and 87 seconds for tests 10, 20, 40 and 80. A little measuring error here and there, but the pattern is clear: twice the lines in sfv = four times the time needed for "dot-counting".

The only difference I see between 8.0 and 8.01rc1 is in second phase, the actual computing of checksums. With random selection of 10 thousands files (so not just one file used for the previous test) it takes 22 seconds in 32-bit 8.0 and 18 seconds in 64-bit 8.0 and both 32 and 64-bit 8.01rc1. So in fact I'm seeing something a little different than umbra, his results show much bigger differences. But perhaps it depends on composition of files, their sizes, etc... and also on disk caching.

Code: Select all

MIME-Version: 1.0
Content-Type: application/octet-stream; name="sfvtest.7z"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="sfvtest.7z"

N3q8ryccAAO0mnemmAIAAAAAAAAjAAAAAAAAAOc9KZH/7uABiV0AOhlKzhwvU/qUvzY0/TBh+Mqx
IwIRkH+GJwMCFSAW5/tecPa/pctiws5dpws+eZhO1lOPNtakQNJngHalJgFNr1OWGTlGA/mo8XBN
NihOJaBwiqRCFJ+hFDa5J1FR5F6MQU/MmzHxWmuxi2ac3CJKyog/XzH7yE/oEz2344C6I6j4BovZ
9X/WDRs/fXergxQhbtHVevM+FFNidqxZ5OVbBQj5x9qt/PtSK3TNHlsgQvndUz34KWQJO4DLKmzf
tTvwxL0uX6oPPktmQpATDv8Qk/hxeFn4C83/lShGD6n8fN77mjAuVsCPhfODgcBlxCVT+PWRNjEF
pbDub8FwTUcM0ZERqq1gHbrOsScYXFmG6WZSWL7I6H+Exve8yvNJhufbyEXSGYIAIj0APq0mwEAA
j7UZWlU54b2iPlZlfkj00HRWg5wEbsy92eiKrZct/y7rmh26zrEnGFxZhulmUli+6XasWeTlWwUI
+cfarfz7Uit0zR5bIEL53VM9+ClkCTuAyyps37U6zcS0i432AHIA7HNTp/2+rnwxGp+3jTFucJ6n
I1/sKMuF0ZWYin4qkfIndfcZwAaYTZj92K/VkA/EJVP49ZE2MQWlsO5vwXBNRwzRkRGqrWAdus6x
JxhcWYbpZlJYvul2rFnk5VsFCPnH2q38+1IrdM0eWx2UrdgRce0AAAAAgTMHrg/VQwVfecZm2o3z
7dDhUqjCyNLqjLux7MjS8pRm7k4480rb6Zimdz7DpDNmWCA3SZuUnY4RMKA1Lfmh3761Zx2WagL8
BprHqrzLB4GGrfWOwf0o0PqBWkAtftwCtWU+Fl/LKgqVw8lF8A/b4PO31GHqaZW6JTLrd4/Coeu8
FULSPB0vUli+OwAAFwaCCQEJgI8ABwsBAAEjAwEBBV0AEAAADIDbCgHK2HkaAAA=

Post by *ghisler(Author) » 2012-06-29, 10:23 UTC

Please try RC2 - I think I found the problem, it's the function strscan! It is very slow when used with big buffers, even if the character is only a few bytes from the start position! I'm now limiting the search range to 2000 characters.

Sob · Post by *Sob » 2012-06-29, 19:32 UTC

It's much better, although there still might be something left.

1) Phase 1, dot-counting:

Huge improvement, it's now linear also in TC32. But it still takes almost twice the time in TC32 (8 seconds) compared to TC64 (5 seconds with test80.sfv from previous post). Do you see it too?
For real-world use with uncached data it probably won't be a problem, the process should be sufficiently slowed down by harddrive and the difference probably won't be noticeable. But it would be interesting to know where it comes from.

2) Phase 2, checksum verification:

"update result list only every 100ms" mentioned in history.txt is big improvement for cached data. What previously took 18 seconds, now takes only about 2 in TC64 and 3 in TC32. Again, there is a difference for some reason. With more data (also completely cached) it's 11 seconds for TC64 and 16 seconds for TC32. But large part of that is the actual checksum computing (in case of crc32 it has the same speed in both TC32 and TC64). So the difference must come from something else done in the process. Maybe something similar to what slows down phase 1? Whatever that might be. But it probably won't make much difference with uncached data either.

3) Future improvements, hashing algorithm speed

As I pointed out earlier in this thread (this post), TC's hashing does not seem to be the fastest one. E.g. for SHA-1, on my computer TC can only process 84/67 MB/s (32/64-bit). On average it's enough for me, I have rather slow disks. But with something faster, TC would be slowing me down considerably. The same hardware can process 310 MB/s using system functions. Some clever solution for this would be nice.

umbra · Post by *umbra » 2012-07-01, 13:47 UTC

Here is an updated table of my previous test (~50 000 source files, ~300MiB):

Code: Select all

TC version     "running dots" [s]     whole verification [s]
8.0 x32                   435                        955
8.0 x64                     9                        130

8.01 rc1 x32              420                        905
8.01 rc1 x64                7                        115

8.01 rc2 x32                3                         17
8.01 rc2 x64                3                         16

The performance improvement looks very interesting. With many small files, TCx32 is now more than 50 times faster.

Post by *ghisler(Author) » 2012-07-02, 13:24 UTC

Thanks for your tests! The main problem with counting was the string search function, which took very long with huge CRC files. I never noticed it because I never used that large crc file - I mostly tested fewer larger files than many small.

The other big problem was the scrolling to the END of the list each time a line was added. Now with the update very 100 ms, it adds many lines at the same time (with refresh turned off) and then jumps to the end of all the lines. This way you don't see every single file scrolling by, but you can jump through the errors at the end. TC is no longer limiting the length of the result list now even if there are more than 65535 files.

Sob · Post by *Sob » 2012-07-04, 06:51 UTC

With latest changes, something went wrong in 64-bit RC2:

Code: Select all

test1.sfv:
OK:  008.bin
009.bin
OK:  010.bin
Wrong CRC:  011.bin
012.bin

Errors: 2
OK: 3, not found: 0, read error: 0, wrong checksum: 2

Previous versions and 32-bit RC2 show it correctly:

Code: Select all

test1.sfv:
OK:  008.bin
OK:  009.bin
OK:  010.bin
Wrong CRC:  011.bin
Wrong CRC:  012.bin

Errors: 2
OK: 3, not found: 0, read error: 0, wrong checksum: 2

Test files: http://web.hisoftware.cz/sob/tmp/tc-crc-bug-files.7z

umbra · Post by *umbra » 2012-07-04, 17:56 UTC

I can confirm Sob's issue. The "OK:" / "Wrong CRC:" text is sometimes missing in TCx64 rc2 (not rc1).

Post by *ghisler(Author) » 2012-07-05, 13:53 UTC

Thanks, I can confirm the problem (64-bit only).

Sob · Post by *Sob » 2012-07-06, 20:10 UTC

Confirming the fix for latest problem (missing status text) in RC3.

MarcinW · Post by *MarcinW » 2012-07-06, 23:47 UTC

Hello,

I've got one tip to speed up operations with reference-counted variables (dynamic strings, dynamic arrays - not available in Delphi 2, records containing dynamic strings or dynamic arrays). It may be not a case here, but it may be useful in general.

Short version: we should always use the "const" word in the argument definition, when declaring procedure/function which gets reference-counted arguments that don't need to be changed. Example: we should declare procedure Test(const Str : string) instead of procedure Test(Str : string), when we don't need to modify Str.

Long version: consider the following example:

Code: Select all

procedure Test1(Str : string);
begin
  Length(Str);
end;

procedure Test2(const Str : string);
begin
  Length(Str);
end;

procedure TForm1.Button1Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
  Time1 : Cardinal;
  Time2 : Cardinal;
  TestStr : string;
begin
  TestStr:='abc';

  Time:=GetTickCount;
  for I:=0 to 10000000 do
    Test1(TestStr);
  Time1:=GetTickCount-Time;

  Time:=GetTickCount;
  for I:=0 to 10000000 do
    Test2(TestStr);
  Time2:=GetTickCount-Time;

  Caption:='Time1='+IntToStr(Time1)+'ms, Time2='+IntToStr(Time2)+'ms';
end;

The results on test computer are: 1050ms for Test1 and 150 ms for Test2. Why so big difference? Let's see Test1 and Test2 in assembler:

Code: Select all

Test2       proc near
            call    @System@@LStrLen
            retn
Test2       endp

Code: Select all

Test1       proc near

var_AutogeneratedString= dword ptr -4

            push    ebp
            mov     ebp, esp
            push    ecx
            mov     [ebp+var_AutogeneratedString], eax
            mov     eax, [ebp+var_AutogeneratedString]
            call    @System@@LStrAddRef
            xor     eax, eax

autogenerated_try:
            push    ebp
            push    offset autogenerated_finally2
            push    dword ptr fs:[eax]
            mov     fs:[eax], esp
            mov     eax, [ebp+var_AutogeneratedString]
            call    @System@@LStrLen
            xor     eax, eax
            pop     edx
            pop     ecx
            pop     ecx
            mov     fs:[eax], edx
            push    offset EndProc

autogenerated_finally1:
            lea     eax, [ebp+var_AutogeneratedString]
            call    @System@@LStrClr
            retn
; ---------------------------------------------------------------------------

autogenerated_finally2:
            jmp     @System@@HandleFinally
; ---------------------------------------------------------------------------
            jmp     short autogenerated_finally1
; ---------------------------------------------------------------------------

EndProc:
            pop     ecx
            pop     ebp
            retn
Test1       endp

As we can see, the compiler generated an invisible, internal copy of the argument string in Test1. This string is being assigned at the beginning (LStrAddRef), then there is a LStrLen (= Length) call for this copied string, and this copied string is being freed at the end - inside the autogenerated, invisible finally section (LStrClr).

For Test2, we instructed the compiler that we are not going to modify the string (in effect the compiler won't allow to modify this string inside Test2), so compiler calls LStrLen directly with the argument string, without creating any copies.

It is even more important to use the "const" word for reference-counted records:

Code: Select all

type
  TTestRec = record
    X1 : Byte;
    X2 : string;
  end;

procedure Test1(TestRec : TTestRec);
begin
  Length(TestRec.X2);
end;

procedure Test2(const TestRec : TTestRec);
begin
  Length(TestRec.X2);
end;

In this case, the compiler generates AddRefRecord instead of LStrAddRef and FinalizeRecord instead of LStrClr for Test1. Adding/clearing the reference for records is much slower than for strings and our test gets 2030ms on test computer.

Regards!

umbra · Post by *umbra » 2012-07-07, 17:03 UTC

I also confirm, that Sob's bug is solved.

2MarcinW
I always though that using 'const' in such situations was just a good manner to make code more obvious and less error-prone. A good compiler should be able to recognize, that you are not changing some variable - especially in your simple test. It's interesting, that it's not the case here.