Verify checksum: bad performance
Moderators: Hacker, petermad, Stefan2, white
- ghisler(Author)
- Site Admin
- Posts: 50541
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
For test purposes, I'm usingMarcinW's suggested method to get the file size now, but only in the 64-bit version. That's the "running dots" you see. But the 8.0 version still used the FindFirstFile method!
Maybe you just saw the operating system caching the directory?
Maybe you just saw the operating system caching the directory?
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
2ghisler(Author)
I know what the "running dots" actually means, I wrote it in my first post. I'm just using the term because it's simple and obvious (it describes exactly what one sees).
I retested all versions on a different computer with Windows 8 x64 (the previous test was on Windows 7 x64). This time, before each try, I restarted computer to get rid of caching. But the results are basically the same:
I know what the "running dots" actually means, I wrote it in my first post. I'm just using the term because it's simple and obvious (it describes exactly what one sees).
I retested all versions on a different computer with Windows 8 x64 (the previous test was on Windows 7 x64). This time, before each try, I restarted computer to get rid of caching. But the results are basically the same:
Code: Select all
TC version "running dots" [s] whole verification [s]
8.0 x32 435 955
8.01 x32 420 905
8.0 x64 9 130
8.01 x64 7 115
Windows 10 Pro x64, Windows 11 Pro x64
I see the same thing. In 32-bit TC, the dots are definitely not running. They are crawling at best.
I tried it without restarting. Just to measure TC's raw dot-counting power. ;) So I did not measure the actual checksum computing, just the "running dots" part this time. Everything was cached nicely in memory, so that moody harddrive couldn't influence anything. I repeated it few times and took the best results (it needed few runs to convince Windows to really keep everything cached).
For sfv file with 50.000 lines it was 100/5 seconds (32-bit/64-bit). With less lines it was relatively uninteresting for 64-bit version (I mean, how faster it can get, when starting with already fast 5 seconds), but for 32-bit version it was different:
50.000 lines took 100 seconds, 40.000 lines 63 seconds, 30.000 lines 42 seconds, 20.000 lines 17 seconds and 10.000 lines only 4 seconds. Another interesting thing, at the beginning the dots appear slowly, but it gets faster and faster and at the end they fly almost as fast as in 64-version. And for the whole time, the bottleneck is CPU, TC is squeezing maximum out of one core.
There is no difference between 8.0 and 8.01rc1. I can go back as far as to 8.0beta1 and it's always the same, 64-bit fast, 32-bit horribly slow (even 7.56a is the same).
I tried it without restarting. Just to measure TC's raw dot-counting power. ;) So I did not measure the actual checksum computing, just the "running dots" part this time. Everything was cached nicely in memory, so that moody harddrive couldn't influence anything. I repeated it few times and took the best results (it needed few runs to convince Windows to really keep everything cached).
For sfv file with 50.000 lines it was 100/5 seconds (32-bit/64-bit). With less lines it was relatively uninteresting for 64-bit version (I mean, how faster it can get, when starting with already fast 5 seconds), but for 32-bit version it was different:
50.000 lines took 100 seconds, 40.000 lines 63 seconds, 30.000 lines 42 seconds, 20.000 lines 17 seconds and 10.000 lines only 4 seconds. Another interesting thing, at the beginning the dots appear slowly, but it gets faster and faster and at the end they fly almost as fast as in 64-version. And for the whole time, the bottleneck is CPU, TC is squeezing maximum out of one core.
There is no difference between 8.0 and 8.01rc1. I can go back as far as to 8.0beta1 and it's always the same, 64-bit fast, 32-bit horribly slow (even 7.56a is the same).
- ghisler(Author)
- Site Admin
- Posts: 50541
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Strange, TC 8.01 is doing the counting and refreshing now in different threads - I see a difference like 45 seconds vs. 30 seconds between TC 8 and TC 8.01...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Try the following simple test. Unpack the archive to get sfv files with 10, 20, 40 and 80 thousands lines. All lines reference single file (TC not checking for duplicates works great for this purpose). Start the verification and watch the clock. With 64-bit TC, the amount of time needed for "dot-counting phase" is linear. About 5 seconds for test 80, half for 40, etc.. But with 32-bit TC I need 2, 6, 22 and 87 seconds for tests 10, 20, 40 and 80. A little measuring error here and there, but the pattern is clear: twice the lines in sfv = four times the time needed for "dot-counting".
The only difference I see between 8.0 and 8.01rc1 is in second phase, the actual computing of checksums. With random selection of 10 thousands files (so not just one file used for the previous test) it takes 22 seconds in 32-bit 8.0 and 18 seconds in 64-bit 8.0 and both 32 and 64-bit 8.01rc1. So in fact I'm seeing something a little different than umbra, his results show much bigger differences. But perhaps it depends on composition of files, their sizes, etc... and also on disk caching.
The only difference I see between 8.0 and 8.01rc1 is in second phase, the actual computing of checksums. With random selection of 10 thousands files (so not just one file used for the previous test) it takes 22 seconds in 32-bit 8.0 and 18 seconds in 64-bit 8.0 and both 32 and 64-bit 8.01rc1. So in fact I'm seeing something a little different than umbra, his results show much bigger differences. But perhaps it depends on composition of files, their sizes, etc... and also on disk caching.
Code: Select all
MIME-Version: 1.0
Content-Type: application/octet-stream; name="sfvtest.7z"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="sfvtest.7z"
N3q8ryccAAO0mnemmAIAAAAAAAAjAAAAAAAAAOc9KZH/7uABiV0AOhlKzhwvU/qUvzY0/TBh+Mqx
IwIRkH+GJwMCFSAW5/tecPa/pctiws5dpws+eZhO1lOPNtakQNJngHalJgFNr1OWGTlGA/mo8XBN
NihOJaBwiqRCFJ+hFDa5J1FR5F6MQU/MmzHxWmuxi2ac3CJKyog/XzH7yE/oEz2344C6I6j4BovZ
9X/WDRs/fXergxQhbtHVevM+FFNidqxZ5OVbBQj5x9qt/PtSK3TNHlsgQvndUz34KWQJO4DLKmzf
tTvwxL0uX6oPPktmQpATDv8Qk/hxeFn4C83/lShGD6n8fN77mjAuVsCPhfODgcBlxCVT+PWRNjEF
pbDub8FwTUcM0ZERqq1gHbrOsScYXFmG6WZSWL7I6H+Exve8yvNJhufbyEXSGYIAIj0APq0mwEAA
j7UZWlU54b2iPlZlfkj00HRWg5wEbsy92eiKrZct/y7rmh26zrEnGFxZhulmUli+6XasWeTlWwUI
+cfarfz7Uit0zR5bIEL53VM9+ClkCTuAyyps37U6zcS0i432AHIA7HNTp/2+rnwxGp+3jTFucJ6n
I1/sKMuF0ZWYin4qkfIndfcZwAaYTZj92K/VkA/EJVP49ZE2MQWlsO5vwXBNRwzRkRGqrWAdus6x
JxhcWYbpZlJYvul2rFnk5VsFCPnH2q38+1IrdM0eWx2UrdgRce0AAAAAgTMHrg/VQwVfecZm2o3z
7dDhUqjCyNLqjLux7MjS8pRm7k4480rb6Zimdz7DpDNmWCA3SZuUnY4RMKA1Lfmh3761Zx2WagL8
BprHqrzLB4GGrfWOwf0o0PqBWkAtftwCtWU+Fl/LKgqVw8lF8A/b4PO31GHqaZW6JTLrd4/Coeu8
FULSPB0vUli+OwAAFwaCCQEJgI8ABwsBAAEjAwEBBV0AEAAADIDbCgHK2HkaAAA=
- ghisler(Author)
- Site Admin
- Posts: 50541
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Please try RC2 - I think I found the problem, it's the function strscan! It is very slow when used with big buffers, even if the character is only a few bytes from the start position! I'm now limiting the search range to 2000 characters.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
It's much better, although there still might be something left.
1) Phase 1, dot-counting:
Huge improvement, it's now linear also in TC32. But it still takes almost twice the time in TC32 (8 seconds) compared to TC64 (5 seconds with test80.sfv from previous post). Do you see it too?
For real-world use with uncached data it probably won't be a problem, the process should be sufficiently slowed down by harddrive and the difference probably won't be noticeable. But it would be interesting to know where it comes from.
2) Phase 2, checksum verification:
"update result list only every 100ms" mentioned in history.txt is big improvement for cached data. What previously took 18 seconds, now takes only about 2 in TC64 and 3 in TC32. Again, there is a difference for some reason. With more data (also completely cached) it's 11 seconds for TC64 and 16 seconds for TC32. But large part of that is the actual checksum computing (in case of crc32 it has the same speed in both TC32 and TC64). So the difference must come from something else done in the process. Maybe something similar to what slows down phase 1? Whatever that might be. But it probably won't make much difference with uncached data either.
3) Future improvements, hashing algorithm speed
As I pointed out earlier in this thread (this post), TC's hashing does not seem to be the fastest one. E.g. for SHA-1, on my computer TC can only process 84/67 MB/s (32/64-bit). On average it's enough for me, I have rather slow disks. But with something faster, TC would be slowing me down considerably. The same hardware can process 310 MB/s using system functions. Some clever solution for this would be nice.
1) Phase 1, dot-counting:
Huge improvement, it's now linear also in TC32. But it still takes almost twice the time in TC32 (8 seconds) compared to TC64 (5 seconds with test80.sfv from previous post). Do you see it too?
For real-world use with uncached data it probably won't be a problem, the process should be sufficiently slowed down by harddrive and the difference probably won't be noticeable. But it would be interesting to know where it comes from.
2) Phase 2, checksum verification:
"update result list only every 100ms" mentioned in history.txt is big improvement for cached data. What previously took 18 seconds, now takes only about 2 in TC64 and 3 in TC32. Again, there is a difference for some reason. With more data (also completely cached) it's 11 seconds for TC64 and 16 seconds for TC32. But large part of that is the actual checksum computing (in case of crc32 it has the same speed in both TC32 and TC64). So the difference must come from something else done in the process. Maybe something similar to what slows down phase 1? Whatever that might be. But it probably won't make much difference with uncached data either.
3) Future improvements, hashing algorithm speed
As I pointed out earlier in this thread (this post), TC's hashing does not seem to be the fastest one. E.g. for SHA-1, on my computer TC can only process 84/67 MB/s (32/64-bit). On average it's enough for me, I have rather slow disks. But with something faster, TC would be slowing me down considerably. The same hardware can process 310 MB/s using system functions. Some clever solution for this would be nice.
Here is an updated table of my previous test (~50 000 source files, ~300MiB):
The performance improvement looks very interesting. With many small files, TCx32 is now more than 50 times faster. 
Code: Select all
TC version "running dots" [s] whole verification [s]
8.0 x32 435 955
8.0 x64 9 130
8.01 rc1 x32 420 905
8.01 rc1 x64 7 115
8.01 rc2 x32 3 17
8.01 rc2 x64 3 16

Windows 10 Pro x64, Windows 11 Pro x64
- ghisler(Author)
- Site Admin
- Posts: 50541
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Thanks for your tests! The main problem with counting was the string search function, which took very long with huge CRC files. I never noticed it because I never used that large crc file - I mostly tested fewer larger files than many small.
The other big problem was the scrolling to the END of the list each time a line was added. Now with the update very 100 ms, it adds many lines at the same time (with refresh turned off) and then jumps to the end of all the lines. This way you don't see every single file scrolling by, but you can jump through the errors at the end. TC is no longer limiting the length of the result list now even if there are more than 65535 files.
The other big problem was the scrolling to the END of the list each time a line was added. Now with the update very 100 ms, it adds many lines at the same time (with refresh turned off) and then jumps to the end of all the lines. This way you don't see every single file scrolling by, but you can jump through the errors at the end. TC is no longer limiting the length of the result list now even if there are more than 65535 files.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
With latest changes, something went wrong in 64-bit RC2:
Previous versions and 32-bit RC2 show it correctly:
Test files: http://web.hisoftware.cz/sob/tmp/tc-crc-bug-files.7z
Code: Select all
test1.sfv:
OK: 008.bin
009.bin
OK: 010.bin
Wrong CRC: 011.bin
012.bin
Errors: 2
OK: 3, not found: 0, read error: 0, wrong checksum: 2
Code: Select all
test1.sfv:
OK: 008.bin
OK: 009.bin
OK: 010.bin
Wrong CRC: 011.bin
Wrong CRC: 012.bin
Errors: 2
OK: 3, not found: 0, read error: 0, wrong checksum: 2
- ghisler(Author)
- Site Admin
- Posts: 50541
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Hello,
I've got one tip to speed up operations with reference-counted variables (dynamic strings, dynamic arrays - not available in Delphi 2, records containing dynamic strings or dynamic arrays). It may be not a case here, but it may be useful in general.
Short version: we should always use the "const" word in the argument definition, when declaring procedure/function which gets reference-counted arguments that don't need to be changed. Example: we should declare procedure Test(const Str : string) instead of procedure Test(Str : string), when we don't need to modify Str.
Long version: consider the following example:
The results on test computer are: 1050ms for Test1 and 150 ms for Test2. Why so big difference? Let's see Test1 and Test2 in assembler:
As we can see, the compiler generated an invisible, internal copy of the argument string in Test1. This string is being assigned at the beginning (LStrAddRef), then there is a LStrLen (= Length) call for this copied string, and this copied string is being freed at the end - inside the autogenerated, invisible finally section (LStrClr).
For Test2, we instructed the compiler that we are not going to modify the string (in effect the compiler won't allow to modify this string inside Test2), so compiler calls LStrLen directly with the argument string, without creating any copies.
It is even more important to use the "const" word for reference-counted records:
In this case, the compiler generates AddRefRecord instead of LStrAddRef and FinalizeRecord instead of LStrClr for Test1. Adding/clearing the reference for records is much slower than for strings and our test gets 2030ms on test computer.
Regards!
I've got one tip to speed up operations with reference-counted variables (dynamic strings, dynamic arrays - not available in Delphi 2, records containing dynamic strings or dynamic arrays). It may be not a case here, but it may be useful in general.
Short version: we should always use the "const" word in the argument definition, when declaring procedure/function which gets reference-counted arguments that don't need to be changed. Example: we should declare procedure Test(const Str : string) instead of procedure Test(Str : string), when we don't need to modify Str.
Long version: consider the following example:
Code: Select all
procedure Test1(Str : string);
begin
Length(Str);
end;
procedure Test2(const Str : string);
begin
Length(Str);
end;
procedure TForm1.Button1Click(Sender: TObject);
var
I : Integer;
Time : Cardinal;
Time1 : Cardinal;
Time2 : Cardinal;
TestStr : string;
begin
TestStr:='abc';
Time:=GetTickCount;
for I:=0 to 10000000 do
Test1(TestStr);
Time1:=GetTickCount-Time;
Time:=GetTickCount;
for I:=0 to 10000000 do
Test2(TestStr);
Time2:=GetTickCount-Time;
Caption:='Time1='+IntToStr(Time1)+'ms, Time2='+IntToStr(Time2)+'ms';
end;
Code: Select all
Test2 proc near
call @System@@LStrLen
retn
Test2 endp
Code: Select all
Test1 proc near
var_AutogeneratedString= dword ptr -4
push ebp
mov ebp, esp
push ecx
mov [ebp+var_AutogeneratedString], eax
mov eax, [ebp+var_AutogeneratedString]
call @System@@LStrAddRef
xor eax, eax
autogenerated_try:
push ebp
push offset autogenerated_finally2
push dword ptr fs:[eax]
mov fs:[eax], esp
mov eax, [ebp+var_AutogeneratedString]
call @System@@LStrLen
xor eax, eax
pop edx
pop ecx
pop ecx
mov fs:[eax], edx
push offset EndProc
autogenerated_finally1:
lea eax, [ebp+var_AutogeneratedString]
call @System@@LStrClr
retn
; ---------------------------------------------------------------------------
autogenerated_finally2:
jmp @System@@HandleFinally
; ---------------------------------------------------------------------------
jmp short autogenerated_finally1
; ---------------------------------------------------------------------------
EndProc:
pop ecx
pop ebp
retn
Test1 endp
For Test2, we instructed the compiler that we are not going to modify the string (in effect the compiler won't allow to modify this string inside Test2), so compiler calls LStrLen directly with the argument string, without creating any copies.
It is even more important to use the "const" word for reference-counted records:
Code: Select all
type
TTestRec = record
X1 : Byte;
X2 : string;
end;
procedure Test1(TestRec : TTestRec);
begin
Length(TestRec.X2);
end;
procedure Test2(const TestRec : TTestRec);
begin
Length(TestRec.X2);
end;
Regards!
I also confirm, that Sob's bug is solved.
2MarcinW
I always though that using 'const' in such situations was just a good manner to make code more obvious and less error-prone. A good compiler should be able to recognize, that you are not changing some variable - especially in your simple test. It's interesting, that it's not the case here.
2MarcinW
I always though that using 'const' in such situations was just a good manner to make code more obvious and less error-prone. A good compiler should be able to recognize, that you are not changing some variable - especially in your simple test. It's interesting, that it's not the case here.
Windows 10 Pro x64, Windows 11 Pro x64