Verify checksum: bad performance

Bug reports will be moved here when the described bug has been fixed

Moderators: white, Hacker, petermad, Stefan2

User avatar
Thunderchild
Junior Member
Junior Member
Posts: 41
Joined: 2006-03-15, 21:50 UTC
Location: Ilmenau, Germany

Verify checksum: bad performance

Post by *Thunderchild »

Hello there

I found out about 8.0 and thought I'd give it a try.

Prologue
I am in the process of migrating my X-Plane installation from the internal to an external harddisk. With the whole world scenery, that's around 74 GB in 68.000 files. In order to make sure all is there and intact, I want to double-check using MD5 (the external HDD case's socket is a bit loose, and after countless hours of copying and defragmenting, there's potential for corrupt files).

Observations
In Linux, creating MD5 sums of the entire external drive partition took around 55 minutes (single-threaded). That's also roughly the time it takes to read all files (25 MB/s maximum over USB2). Because I encountered errors along the way (cable lost connection/defragmentation complication), I wanted to cross-check in Windows, because I use NTFS for that drive.

So I took the result of the Linux check, reformatted its content Windows-compatible (replaced / by \, but kept Linux-style newlines) and fed it to Total Commander, that it could verify the sums. The dialogue came up and was filled with dots for around 1.5 hours (just one line; it gets filled, emptied and filled again). Eventually, it started listing all files with the OK message, at roughly 4 files per second. So I estimate it would have taken another 5 hours to complete (I aborted after half an hour or so).

Bug?
So why is this process so slow? Is it reading and checking all files twice? If any different at all, I would have expected better performance, because NTFS is native to Windows.


Cheers

Edit: I'm now approaching it the other way around -- I create a new checksum file with TC and will then compare that file with the Linux result file. So far the speed is more what I expected (40% after about 20 minutes).
Gruß | Greetings | Quapla'
"All I ask is a tall ship and a star to steer her by."
umbra
Power Member
Power Member
Posts: 872
Joined: 2012-01-14, 20:41 UTC

Post by *umbra »

Verifying checksums is very slow indeed.

I just created a checksum for a folder full of source files. Roughly 50 000 files, 300MiB. Creating an MD5 checksum took less than 5 seconds. However checking those files took 15 minutes. Almost a half of that time was consumed by that "running dots" part.

By observing TC through Process Explorer, I think I might know what's causing the slowdown (please correct me if I'm wrong).

#1
During the "running dots" part, TC loads the CRC file into memory, creates some memory structures for it and that goes through all its files and calculates an occupied space. With many files (Thunderchild said 68 000) and especially on slow drives (like Thunderchild's external harddisk) this may take a very long time.

#2
The second thing, that is slowing the process significantly, is a constant updating of processed files. With relatively small files, the updating of that list can easily take longer than the checksum calculation itself. I'm talking about normal HDD, now imagine SSD. Updating that list in a constant interval might speed things up considerably (something similar to what has been done to progress bars in the current version).

edit:
ad #2: Is it really necessary to automatically show all processed files? Showing only the bad files and the currently processed one should be enough. There should be a separate button to show a complete log. That way things would be faster, the progress would be much easier to control by eyes, plus, in the end, one could copy the entire log into a clipboard instead of a single line only, like now.
Windows 7 Pro x64, Windows 10 Pro x64
User avatar
Thunderchild
Junior Member
Junior Member
Posts: 41
Joined: 2006-03-15, 21:50 UTC
Location: Ilmenau, Germany

Post by *Thunderchild »

Perhaps, in order to make it more coherent with other TC dialogues:
show a progress bar (as is done when creating an MD5 file) during verification that shows current file/number of files, and show the current file's name underneath. So it would looks quite like the Copy dialogue. A simple label could show the number of failed verifications.

Ironically, the total size of files is not shown during verification, if I remember correctly. So why calculate it in the first place? OTOH, the total file size is a nice information if the files to be checked are huge in total, but vary greatly in size.
Gruß | Greetings | Quapla'
"All I ask is a tall ship and a star to steer her by."
umbra
Power Member
Power Member
Posts: 872
Joined: 2012-01-14, 20:41 UTC

Post by *umbra »

So why calculate it in the first place?
I don't know if it's actually calculated. But I think so, because the progress shown in the dialog's title seems to be pretty accurate. Anyway, let's wait for the author's answer - who knows, maybe I'm completely wrong.
Windows 7 Pro x64, Windows 10 Pro x64
Sob
Power Member
Power Member
Posts: 941
Joined: 2005-01-19, 17:33 UTC

Post by *Sob »

It's not just counting the files (or whatever that is) what is slow, but also the hash calculation itself.

Some time ago, as a small programming exercise, I wrote a little program (*1) that quickly compares two files by one or more hashes. It uses system hashing functions (*2) and compared to TC, it's surprisingly fast.

For test (*3) with ~4GB file completely cached in memory (so there's no slowdown from disk reads), the results are following (numbers are times in seconds):

Code: Select all

                 md5    sha-1  sha-256
TC 7.56a         24     42     63
TC 8.0 (32)      25     44     66
TC 8.0 (64)      25     55     69
ChkFiles (32)    10     12     40
ChkFiles (64)    10     12     30
I guess the system functions are well optimized for different CPUs and instruction sets, so it's probably not exactly a fair fight. On the other hand, if TC managed to be also this fast, I wouldn't say no to it. ;)

--
(*1) http://web.hisoftware.cz/sob/download/ChkFiles1.7z (binaries and source)
(*2) except for CRC32 which in turn is slower than in TC
(*3) by running "ChkFiles.exe /hash=<number> slackware-13.0-install-dvd.iso empty.txt" (number: 2=md5, 4=sha-1, 8=sha-256, empty.txt is as name suggests an empty file) and for TC just by clicking existing .md5./sha file and watching the clock
umbra
Power Member
Power Member
Posts: 872
Joined: 2012-01-14, 20:41 UTC

Post by *umbra »

Sob wrote:number: 2=md5, 4=sha-1, 8=sha-256
I guess you meant "number: 4=md5, 8=sha-1, 16=sha-256". :) And yes, your tool really seems to be a few times faster than TC' method.
Windows 7 Pro x64, Windows 10 Pro x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48232
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Let me guess - you have a background virus scanner on Windows, but not on Linux?
Author of Total Commander
https://www.ghisler.com
umbra
Power Member
Power Member
Posts: 872
Joined: 2012-01-14, 20:41 UTC

Post by *umbra »

ghisler(Author) wrote:Let me guess - you have a background virus scanner on Windows, but not on Linux?
That may be true, but from what I've seen (see my first post), TC is not slow, when it's creating checksums, only when it's checking them - in some cases, it is several orders of magnitude slower. And this is, what the OP was complaining about.
Windows 7 Pro x64, Windows 10 Pro x64
User avatar
MarcinW
Power Member
Power Member
Posts: 852
Joined: 2012-01-23, 15:58 UTC
Location: Poland

Post by *MarcinW »

I suppose that the partial cause of the problem is a listbox itself. Listboxes are known to be very slow when adding many lines. Adding a new line consumes more time than adding a previous one.

Moreover, updating a progress bar or a window caption for each file - when processing many small files - may be the most time-consuming part of the task.

So I use the GetTickCount function and update a progress information in 100 ms intervals:

Code: Select all

UpdateCounter:=0;

for I:=0 to FilesToProcess-1 do
begin

  ...

  if (GetTickCount-UpdateCounter >= 100) or (UpdateCounter = 0) or (I = FilesToProcess-1) or AbortedByUser then
  begin

    // update window caption, progress bar etc.

    UpdateCounter:=GetTickCount;
  end;
end;

To speed up the listbox, we can disable redrawing for each new line and redraw only for some lines:

Code: Select all

// Test 1: about 700ms on test computer

procedure TForm1.Button1Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
begin
  ListBox1.Clear;

  Time:=GetTickCount;

  for I:=0 to 10000 do
    ListBox1.Items.Add(IntToStr(I));

  Caption:='Time='+IntToStr(GetTickCount-Time)+'ms';
end;

Code: Select all

// Test 2: about 320ms on test computer

procedure TForm1.Button2Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
begin
  ListBox1.Clear;

  Time:=GetTickCount;

  ListBox1.Perform(WM_SETREDRAW,0,0);
  try

    for I:=0 to 10000 do
    begin
      ListBox1.Items.Add(IntToStr(I));

      if I mod 32 = 0 then
      begin
        ListBox1.Perform(WM_SETREDRAW,1,0);
        RedrawWindow(ListBox1.Handle,nil,0,RDW_INVALIDATE or RDW_ALLCHILDREN or RDW_UPDATENOW);
        ListBox1.Perform(WM_SETREDRAW,0,0);
      end;
    end;

  finally
    ListBox1.Perform(WM_SETREDRAW,1,0);
    RedrawWindow(ListBox1.Handle,nil,0,RDW_INVALIDATE or RDW_ALLCHILDREN or RDW_UPDATENOW);
  end;

  Caption:='Time='+IntToStr(GetTickCount-Time)+'ms';
end;
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48232
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

I have already tried it, but it doesn't help much - it goes down from 48 to 34 seconds for a particular check.

The problem is that when calculating the checksums, I just call a FindFirstFile/NextFile loop, and then call the CRC function for each file. When verifying the checksums, I have to use the list in the .sfv/.md5/.sha file. I have to call FindFirstFile for each individual file to find out its size.
Author of Total Commander
https://www.ghisler.com
User avatar
MarcinW
Power Member
Power Member
Posts: 852
Joined: 2012-01-23, 15:58 UTC
Location: Poland

Post by *MarcinW »

I suppose that calling FindFirstFile/FindNextFile is not a main cause of the delay. Improving performance of listbox seems to be much important.

However, the fastest way to obtain a file size is not calling the FindFirstFile/FindNextFile, but opening a file in read-only mode (at least on FAT and NTFS filesystems). In order to obtain a 64-bit file size, I created a FileSize64 function (it uses Int64 type, which is not available in Delphi 2, but can be treated as an example):

Code: Select all

function FileSize64(Handle : THandle) : Int64;
begin
  Int64Rec(Result).Lo:=GetFileSize(Handle,@Int64Rec(Result).Hi);
  if Int64Rec(Result).Lo = INVALID_FILE_SIZE then
  if GetLastError <> ERROR_SUCCESS then
    Result:=-1;
end;

Code: Select all

// Test 1: about FAT32:2440ms / NTFS:600ms on test computer

procedure TForm1.Button1Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
  Size : Int64;
  SearchRec : TSearchRec;
begin
  with TStringList.Create do
  try
    LoadFromFile('FILES.TXT'); // 2400 file names

    Time:=GetTickCount;

    for I:=0 to Count-1 do
    if FindFirst(Strings[I],faAnyFile,SearchRec) = 0 then
    begin
      Int64Rec(Size).Hi:=SearchRec.FindData.nFileSizeHigh;
      Int64Rec(Size).Lo:=SearchRec.FindData.nFileSizeLow;
      FindClose(SearchRec);
    end;

    Caption:='Time='+IntToStr(GetTickCount-Time)+'ms';
  finally
    Free;
  end;
end;

Code: Select all

// Test 2: about FAT32:280ms / NTFS:330ms on test computer

procedure TForm1.Button2Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
  Size : Int64;
  F : File;
begin
  with TStringList.Create do
  try
    LoadFromFile('FILES.TXT'); // 2400 file names

    Time:=GetTickCount;

    for I:=0 to Count-1 do
    try
      AssignFile(F,Strings[I]);
      FileMode:=0; // IMPORTANT !!!
      Reset(F,1);
      try
        Size:=FileSize64(TFileRec(F).Handle);
      finally
        CloseFile(F);
      end;
    except
    end;

    Caption:='Time='+IntToStr(GetTickCount-Time)+'ms';
  finally
    Free;
  end;
end;

Code: Select all

// Test 3: about FAT32:280ms / NTFS:330ms on test computer

procedure TForm1.Button3Click(Sender: TObject);
var
  I : Integer;
  Time : Cardinal;
  Size : Int64;
  H : THandle;
begin
  with TStringList.Create do
  try
    LoadFromFile('FILES.TXT'); // 2400 file names

    Time:=GetTickCount;

    for I:=0 to Count-1 do
    begin
      H:=CreateFileA(PAnsiChar(Strings[I]),0,0,nil,OPEN_EXISTING,0,0);
      if H <> INVALID_HANDLE_VALUE then
      try
        Size:=FileSize64(H);
      finally
        CloseHandle(H);
      end;
    end;

    Caption:='Time='+IntToStr(GetTickCount-Time)+'ms';
  finally
    Free;
  end;
end;
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48232
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Opening the files just to get their size may be OK without a virus scanner, but not otherwise: The file will be scanned when opening it, but no when just using FindFirstFile.
Author of Total Commander
https://www.ghisler.com
User avatar
MarcinW
Power Member
Power Member
Posts: 852
Joined: 2012-01-23, 15:58 UTC
Location: Poland

Post by *MarcinW »

Hmm, yes... However, in Test 3 there is a CreateFile API called with dwDesiredAccess = 0 (query access) - in this case virus scanners should not slow down the opening operation.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48232
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Interesting, I didn't think that the makers of virus scanners would be that clever. I will change it, and check the reactions...
Author of Total Commander
https://www.ghisler.com
umbra
Power Member
Power Member
Posts: 872
Joined: 2012-01-14, 20:41 UTC

Post by *umbra »

This topic is now in "TC bugs which should be fixed now", so I ran a few tests with TC8.01 rc1.

First, there is practically no difference between TC8.0 and TC8.01 rc1. The new version seems to be a bit faster, but the difference is so small, it might be a measurement error.

Second, and much more interesting - a verification of many files (see my first post) in the 64b version is significantly faster than in the 32b one. And this applies to both versions - stable and rc. I don't know why I didn't notice this before. On the mentioned example, the initial part ("running dots") takes only 10 seconds and the whole verification is done in 2.5 minutes, compared to 7 and 15 minutes respectively with the 32b version.
Windows 7 Pro x64, Windows 10 Pro x64
Post Reply