Suggestion for improved checksum file handling

Here you can propose new features, make suggestions etc.

Moderators: white, Hacker, petermad, Stefan2

Post Reply
MichaelK
Junior Member
Junior Member
Posts: 23
Joined: 2004-07-08, 07:25 UTC
Location: Stromberg, Rlp

Suggestion for improved checksum file handling

Post by *MichaelK »

I use TC since many years, but only after installing a new NAS system a few weeks ago, I found how valuable the checksum feature is. I installed a RAID5 system with 4 x 10TB disks, which give me a total disk space of 30 TB. About half of this space is already filled with multimedia files, i.e. videos, audio and images. Even though I enjoy the RAID5 protection and the snapshot feature of the Btrfs file system, a simple mistake or a bad virus might delete or encrypt my files unnoticed. In order to notice such mishaps, I think it is a good idea to test the checksums in regular intervals. But generating or testing the checksum for dozens of terabytes can take many days. That is acceptable for the initial generation of the checksums and for the occasional test, but not for an update of a few files.

Therefore I suggest the following features:

1.) Implementation of a feature for automatic update of a checksum file. Such an update shall only calculate the checksum of the new files, i.e. those, which were not available at the last run. Such an update would be much faster than a full checksum of all files.

2.) As I occasionally delete files from the archive or move them to some other location, it would be useful if such lost files were optionally removed automatically from the checksum file.

3.) It would be handy if the checksum file would be sorted by the filenames. I found that the sequence of files may differ for folders on different volumes with the same content. Sorted lists would make a comparison of both volumes much easier. But that's really nice to have. It can be easily achieved with a good text editor.

4.) It happened once or twice during the generation of the checksums, that a network error occurred. Unfortunately the checksum generation was terminated without a chance to press a "try again" button. That is rather annoying if it happens after 30 hours of work. It wouldn't be that bad if the proposed "update checksum file" feature (item 1 above) were available.

Thank you for considering these suggestions.
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

You may consider to disable SMB on your NAS and use WebDAV or FTP for backups. The crypto troyans also encrypt files on devices accessible via SMB, but they would need to know the password for WebDAV or FTP. I'm using curl to upload backups to a NAS automatically, and Total Commander WebDAV plugin for manual access.
Author of Total Commander
https://www.ghisler.com
User avatar
Dalai
Power Member
Power Member
Posts: 9364
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Post by *Dalai »

Yeah, well, a RAID is NOT a backup, and it's not a replacement for it. To prevent losing files - regardless of the cause (e.g. ransomware or user error) - files need to exist on at least two different drives/media.

Regarding the suggestions 1 and 2: How do you suggest that TC determines the checksum files? Which file name should they use? A fixed one or a (random) user-specified one? What about file name collisions? And what about the hash algorithm? Do you expect TC to analyze the existing checksum files before generating new hashes?

Regarding suggestion 3: Yes, this annoyed me as well, so I wrote a little tool for myself that generates hashes automatically, sorting the file list in the process. This gives persistent checksum files that can be compared in TC's compare by contents.

Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
MichaelK
Junior Member
Junior Member
Posts: 23
Joined: 2004-07-08, 07:25 UTC
Location: Stromberg, Rlp

Post by *MichaelK »

Thank you for your comments. In order to protect against troyans, it is probably easier to set the folder to write protection and remove this protection only for updating the files. That is what I did with my previous NAS, which did not yet have RAID and Btrfs. Like disabling SMB and using FTP, that does not help against my own mistakes and it would not make use of the extremely useful snapshot feature.

As Dalai says, RAID is not backup. Therefore I have copies of the most important folders on separate harddisks, which I plug into a docking station every once in a while and synchronize them with the NAS folders. I generate checksums on these drives as well and compare the checksum files to those on the NAS. That's why I ask for sorted checksum files. But again, I can easily sort the files with Ultraedit, therefore I say sorting is "nice to have".

I'm not sure what you mean with the "name of the checksum file". Unless I click any boxes, it has the same name as the folder, which I check and the extension is .sha. So if I generate a checksum for all files in the folder "video", the checksum file is called video.sha. There is no file name collision, because it would collide in the folder first. I would not expect TC to analyze the existing checksums when updating this file. That would be a separate call to the "verify" function, which I would probably only run twice a year, because it runs for maybe two days.

So again, only the first item is really important for me, as I have no simple workaround. Removing files from the list is possible with a text editor and they are skipped anyway by the verify function. Sorting is nice to have and the last item is not so important, if the checksum file update works. One can then easily restart the checksum process in case of a network error.

I hope that clarifies what I have in mind. Thanks again for your kind support.

Regards,

Michael
User avatar
Dalai
Power Member
Power Member
Posts: 9364
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Post by *Dalai »

MichaelK wrote:I'm not sure what you mean with the "name of the checksum file". Unless I click any boxes, it has the same name as the folder, which I check and the extension is .sha. So if I generate a checksum for all files in the folder "video", the checksum file is called video.sha.
Correct so far. But now your suggestion comes into play which would make TC automatically use the same name (video.sha) when generating the hashes for the new files. How should TC behave when this file already exists (for whatever reason)? Use the file anyway? Potentially "dangerous" when this file is essential to the data saved in this directory. OK, one could make such feature optional, but such collisions could still happen.

Unless I missed something I see many more variables in this scenario than the suggestion makes it sound.

BTW: Did you know about the Verify feature/checkbox in TC's copy dialog? Sure, this doesn't generate checksum files for verification any time later, but at least it rules out any errors in the transfer.

Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
MichaelK
Junior Member
Junior Member
Posts: 23
Joined: 2004-07-08, 07:25 UTC
Location: Stromberg, Rlp

Post by *MichaelK »

I feel we misunderstand each other. What I propose is that TC checks the files in the folder "video" and reads the video.sha files. If it finds files in the folder, but not in video.sha, it should generate the checksum and append it to video.sha. Optionally it may remove lines from video.sha, if the file is no longer found in the folder, but that is nice to have.

Maybe your point is what happens to changed files. The simple answer is nothing, they are ignored by checksum generation. I have a big archive of audio, video and images, a garbage dump if you like. I only add (once or twice a week) or remove (rarely) files. An independent verification run, say twice a year, would reveal that the file was changed and then it is up to the user to decide what to do. That is exactly the purpose of this procedure, a verification run shall flag any changed files. If one has many files, whose content is changed regularly, a checksum would anyway not be useful.

To summarize my proposal: it is meant for consistency checks of archives.
Post Reply