Horst.Epp wrote: 2023-12-24, 21:46 UTC
fcorbelli wrote: 2023-12-24, 18:51 UTC
Windows Defender is... evil... aehm...
Code: Select all
C:\zpaqfranz>sha256deep64 zpaqfranz.exe
ce8ab930d3778ad4bb677ba2077a263f21f241347163a7f0fef1379a4d0c2f22 C:\zpaqfranz\zpaqfranz.exe
I like the Defender, it gives fewer problems than other Antivirus tools.
I was many years responsible for some large companies Antivirus solutions.
The checksum of the zpaqfranz.exe is ok.
I don't trust antivirus so much, preferring to focus on anti-virus and anti-ransomware mechanisms.
The various antiviruses are the very first thing I disable after a Windows installation, as well as UAC, non-admin executions etc.
Yep, I am VERY old school
I try to understand what the benefit is for storing the list in ADS streams ?
Speed
zpaq store the file list inside i blocks, with added and removed files (aka: date==0)
you can see (on unencrypted single zpaq) with
Code: Select all
zpaqfranz dump thefile.zpaq
zpaqfranz dump thefile.zpaq -verbose
zpaq's journaled archives are stored in blocks (aka: chunks) one after the other
Here an example
https://encode.su/threads/456-zpaq-updates?p=81361&viewfull=1#post81361
During the l (list) every block is readed, decoded and (sometimes) jumped off (aka: a seek on datablocks with the "real" compressed data), for every version
Therefore if you have a lot of blocks, and a lot of version, and worst on a spinning drive, listing the files will require a lot of work
Sometimes even minutes, to list a .zpaq
https://encode.su/threads/4168-Virus-like-data-storage-(!)
The -ads switch will
(1) fill up the dt map (aka: the already-present file) during a (add)
(2) do the add as always
(3) seeks to the newer i-blocks inside the archive
(4) decode the new dt map (to the last) [side note: why? to test here the future next-to-be-released hash-based DTMap]
(5) compress one line at time with LZ4 (yes, LZ4)
(6) store inside a zpaqlist ADS
When you ask to list the content of the zpaq, if an ADS is present, the file will be LZ4-decompressed
(one line at times), then printed
LZ4 means "very,very fast, on single threaded CPU, even the older one" and "line by line" means "do-not-create-a-giant-IN-RAM-vector-of the file list " => list a ridicolous big zpaq even on older laptop)
=> you get about just the same data as a regular list, but way, way faster
Sometimes even 100x faster (for huge archives)
Decompressing a LZ4 flux takes about 1/2 seconds (of course depending on file size and computer speed), so the list is almost immediate for (about) any filelist size, because only the very last is stored (on multipart 1 for piece)
You can do a quick test downloading this
http://www.francocorbelli.it/zpaqfranz/zbiz.zpaq
suppose on h:\zbiz.zpaq (a NTFS drive) and running (of course with a tail somewhere in path)
Code: Select all
zpaqfranz l h:\zbiz.zpaq -out normal.txt |tail
this will create a normal.txt with the filelist, using "standard" zpaq read_archive (aka: recomputing from scratch)
Now we create the ADS
Then we list back, this time with the LZ4
OK, now suppose to add something to the archive (as always), just one file (or whatever you like)
Code: Select all
zpaqfranz a h:\zbiz.zpaq normal.txt -ads
Now, if you list "the standard way"
you get, every time, a full decoding => very slow speed
With newer ADS support
in no time
It is not yet the final version, but I hope it makes the usefulness clear (-all does not work etc, it is just a nightly build)
Obviously if you work with tiny archives the difference is negligible, but I need to handle large amounts of data, and zpaqfranz is just my 'toolbox'
Final note: size matters. Even just creating a file list of a few million files can require multiple gigabytes of RAM. Few things are worse than not even being able to see the contents of an archive, because you are using an emegency laptop in the field, very different from the super-powerful office-PC with 128 or even a 768GB Xeon server. zpaq (and zpaqfranz)
behaviour is to read each block, create a map in memory, and then display it
As you can see, this behaviour (which still persists, which is why I am rewriting read_archive to read_archive2) allocates large amounts of memory.
In the future, at least in my intentions, I will arrive at a list function per row, and not per map, because "this" is very "bad" (=>RAM eater)
Basically, it overwrites the data of the individual filename in the various versions added.
This ensures that, when finished, the data is the most up-to-date
However,
fn is really the filename complete with path, which can easily be 100 or 200 bytes long (i.e. fn=z:/biz/biz01/01/mingw32/include/c++/13.2.0/ext/pb_ds/detail/rc_binomial_heap_/rc_binomial_heap_.hpp) , whereas I was thinking of just keeping 8 bytes (the 64-bit XXHASH code). Faster and frugal (but prone to hash collisions)
=>It still needs a lot of work.
ADS is therefore the first step: fast AND frugal in memory once the stream exists (i.e. when the server has created it during the backup update), but NOT very frugal in the CREATION of the filelist[/i]
Final note: those are OT considerations with respect to the zpaq plug in, maybe if they are not interesting the moderators can always delete them.
On the other hand, they also apply to zpaq 707 and 715
zpaqfranz's author. Yes, my name is Franco Corbelli => the zpaqfranz :)