Incorrect unpacking of GZIP files

DeFlar · Post by *DeFlar » 2017-12-14, 11:26 UTC

Found in TC 8.51-x64, then tried in 9.12-x64, but behaviour did not changed. Windows 10 Pro. (Unpacker is set to default pkunzip.exe)

Create GZ archive of .txt file with one UTF-8 string inside (e.g. "foo\n") with BOM.
So content is:{BOM}foo{CRLF}
Then append the same string to the archive.
Now content is:{BOM}foo{CRLF}{BOM}foo{CRLF} - bom is inserted every time adding content to archive
but Total commander extracts the file with content:{BOM}foo{CRLF}, i.e. ignoring the last append with the same content.
7Zip extracts file normally with content:{BOM}foo{CRLF}{BOM}foo{CRLF}

If i again append but another string "bar\n" e.g. then Total Commander extracts file normally with all content:{BOM}foo{CRLF}{BOM}foo{CRLF}{BOM}bar{CRLF}, but subsequent adding of the last string to the archive again produces the same problem.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2017-12-14, 18:20 UTC

UPDATE: With clarification from DeFlar, i am now able to confirm the behavior! Additionally, there is an issue (limitation?) with regard to displaying the uncompressed file size of Gzip archives that is related to the problem by DeFlar.

C̶a̶n̶n̶o̶t̶ ̶c̶o̶n̶f̶i̶r̶m̶ ̶w̶i̶t̶h̶ ̶T̶C̶6̶4̶ ̶9̶.̶1̶2̶ ̶o̶n̶ ̶W̶i̶n̶ ̶7̶ ̶P̶r̶o̶.̶

However, DeFlar, your description is somewhat ambiguous.
Please clarify: What precisely do you mean with "append the same string to the archive". and "bom is inserted every time adding content to archive ". Those statements do not make much sense with regard to Gzip archives.

Note that a Gzip archive contains only one single file. You cannot add anything more or else to a Gzip archive. So, what exactly is the file you packed into a Gzip archive and which caused the problem?

Perhaps add the problematic Gzip file here to the bug. You could Base64-encode the Gzip file with TC (main menu -> Files -> Encode File (MIME, UUE, XXE)...), and then paste the resulting Base64 text blob here in this thread...

FYI, this is what i did trying to reproduce the bug.

1. Created a UTF-8 BOM text file t.txt with the following content: Foo\r\n

2. Did [face=courier]copy /b t.txt + t.txt twice.txt[/face] to create text file according to bug description ({BOM}Foo{CRLF}{BOM}Foo{CRLF})

3. Packed twice.txt into a gz archive (with TC).

4. With a hex viewer, i checked the archive for correctness. Result: Archive is good.

5. Checked TC's behavior. Result: Could not reproduce problematic behavior as described by DeFlar.

DeFlar · Post by *DeFlar » 2017-12-15, 13:19 UTC

Hi, thanks for your reply!

Here is base64 of file:H4sIAEOzMloA/3u/e39afj4vFwCoLPuVCAAAAB+LCABxszJaAP97v3t/Wn4+LxcAqCz7lQgAAAA=

the content of file is:{bom}foo\r\n{bom}foo\r\n

Now I try to explain more in details.
(Sorry, i am not good at english & i don't know much how GZIP should work - i am C# programmer, but i encountered the problem examining how GZipStream in .NET works)

1. I created file named "file.txt.gz" with content "foo\r\n" using GZipStream in C#.
2. Then i modified file "file.txt.gz" without unpacking, just appended new content, which is the same string "foo\r\n". In C# you can easily do this by open stream for append. May be it is wrong operation regard to Gzip archive, but no one can forbid it.

Other archivators can properly uncompress such archives, also c# GZipStream can uncompress such file normally, so i decided that TC does something wrong.

PS. It is wierd for me that every time content is appended to gzip file BOM is inserted. I think this is probably done for separation of compressed parts of archive.

PPS. Please, tell if file, i provided as base64 string, can be normally unpacked!

gdpr deleted 6 · Post by *gdpr deleted 6 » 2017-12-15, 16:01 UTC

I can now confirm the problem.

Not only is the extraction result wrong, TC will also show a wrong size of the file in the Gzip archive when inspecting the archive provided by DeFlar. See end on my post about this issue with displaying file sizes. (Perhaps it is fair to not call it a bug but rather a limitation of TC?)

This is the hex view of the Gzip archive given by DeFlar in his last post:

[face=courier]
00000000 1F 8B 08 00 43 B3 32 5A 00 FF 7B BF 7B 7F 5A 7E
00000010 3E 2F 17 00 A8 2C FB 95 08 00 00 00 1F 8B 08 00
00000020 71 B3 32 5A 00 FF 7B BF 7B 7F 5A 7E 3E 2F 17 00
00000030 A8 2C FB 95 08 00 00 00[/face]

Note that this Gzip archive contains two "members" (compressed data sets; colored differently in the hex view above). This is a feature of the Gzip file format, as clearly stated by the Gzip file format specification (http://www.zlib.org/rfc-gzip.html#scope):

A gzip file consists of a series of "members" (compressed data sets). The format of each member is specified in the following section. The members simply appear one after another in the file, with no additional information before, between, or after them.

Unless the "members" provide an optional FNAME field, the decompressed data of each should be concatenated (similar to the effect of successively outputting all the decompressed data on stdout). Since the Gzip file format specification does not spell this out explicitly, i took the liberty of taking the behavior of the Gzip utility (v 1.3.12, Windows port) as reference.

However, TC is ignoring the 2nd (last?) "member" and only decomresses the first one. As this behavior is different from the GNU Gzip utility, i would consider it a bug or limitation in TC. Funnily enough, both GNU Gzip as well as TC fail to show the correct uncompressed file size (both show only the ISIZE field of the last "member" at the very end of the Gzip archive).

(Side note: Each member in a Gzip file can provide an optional FNAME field -- which would imply that a Gzip archive could indeed store multiple files. Although www.gzip.org mentions that this can't be done "directly" with Gzip. Man, that file format is really poorly specified... almost like CSV

)

Now to the issue regarding the display of files sizes i alluded to at the beginning of my post.

Remember that TC can show the content of an archive. That includes showing the file size of a file in an archive.

However, with Gzip archives, determining and showing the correct uncompressed size of a file in Gzip archive can become a very costly endeavor if said Gzip file is large-ish and made of multiple "members" (compressed data sets).

Since compressed data sets in a Gzip archive do not specify the length of the compressed data set, it would be necessary to "crawl" through all but the last data set (i.e., decompressing it on-the fly) to find the next compressed data set. Even to determine whether a Gzip archive would have more than one compressed data set would require to "crawl" through the first compressed data set.

Now, imagine having a Gzip file of maybe 200MB. Now, imagine TC having to read (und decompress in-memory) most of this 200MB file just to determine the actual size of the uncompressed file. This would be terrible. No idea how this could be resolved in an elegant manner. (Perhaps not showing file sizes - or "unknown" file size - for files in Gzip archives, if that would be possible, or perhaps leaving it as it is would be more desirable...???)

gdpr deleted 6 · Post by *gdpr deleted 6 » 2017-12-15, 16:06 UTC

DeFlar wrote:PS. It is wierd for me that every time content is appended to gzip file BOM is inserted. I think this is probably done for separation of compressed parts of archive.

That is not related to Gzip or the TC bug you discovered here, but rather the way you created (appended) the second compressed data set. In other words, you (or the program you used) provided the second text block together with the UTF-8 BOM to the Gzip compressor.

DeFlar wrote:PPS. Please, tell if file, i provided as base64 string, can be normally unpacked!

No problem. I could decode your Base64-blob just fine

Post by *ghisler(Author) » 2017-12-18, 15:45 UTC

That's strange, TC has supported multi-stream GZ files for many years. I guess that it doesn't correctly recognize the header in your specific case. I will check it in the debugger.

Post by *ghisler(Author) » 2018-05-04, 07:36 UTC

This should be fixed in TC 9.20 beta 1, please test it!

It happens when the 2 parts have the exact same content, and therefore the same checksum.