[idea] [pending] split files at EOF markers

chrizoo · Post by *chrizoo » 2008-12-06, 23:16 UTC

Hi. Currently you can specify a fixed files size to split files in [menu: file>>split files]

suggestion: in addition to the file sizes, TC could offer to look for end of file (EOF) and begin of file markers and split the file accordingly.
(This would for example come in handy for embedded image thumbnails, cover art in mp3 tags, etc. )

karlchen · Post by *karlchen » 2008-12-08, 15:48 UTC

Hi, chrizoo.

I am not quite sure what you have in mind when using the term "EOF marker".

As far as I recall only pure ASCII files once upon a time, light years ago, used ^Z (ASCII 26) in order to mark the end of the file.
I never see this ^Z anywhere any longer.

And what is a "begin of file marker"?

Karl

chrizoo · Post by *chrizoo » 2008-12-09, 00:36 UTC

I'm not really an expert in this domain unfortunately. But I have a command line tool, that claims to do the following:

... looks through image files for a second SOI (start of image) marker. If it finds one, it copies that to a new file until it reaches an EOI marker.

That way, I can - for example extract all hidden thumbnails in an image. This could come in quite handy for other files, too. For example, there are often images in tagged music files (for example cover art in mp3 files).
I also once had a case, where I had to use a recovery program for a colleague's machine and some of the recovered files had some trash before and after the begin/end of file.

I don't know about ^Z etc. but I thought there might be one or more special character combinations to mark the beginning or end of file. And in this case TC could just split the file accordingly. For example:

[data chunk 1 -- BOF -- data chunk 2 -- EOF--BOF-- data chunk 3 -- EOF -- data chunk 4]

splitting to (different files):

[data chunk1]
[data chunk2]
[data chunk3]
[data chunk4]

There were a couple of other cases, where I thought this would be quite useful.

fenix_productions · Post by *fenix_productions » 2008-12-09, 01:12 UTC

2chrizoo
You can't do that in simple way.

Some files / images types defines it's end in header area. If you have something after it, viewers will not see that data and will cut additional parts after saving. You can't split it easily because it would require database with all known headers. Imagine the time of files processing for splitting purposes. You should also know that not every file has the size defined in header (ex. text files don't).

Your command line tool probably supports some types and works with them but it will never work for all. It simply reads file's header and extracts all data according to it. If there is something more it will be placed in additional file.

EOF can be something completely different from end of image. So your sample should look like:

[BOF -- HEAD -- data chunk 1 -- EODS -- HEAD -- data chunk 2 -- EODS -- HEAD -- data chunk 3 -- EODS -- EOF]

where:
BOF - beginning of file,
HEAD - structure header,
EODS - end of data structure (supposed end, according to header),
EOF - end of file.

Some files has:
[BOF -- HEAD -- data chunk 1 -- EODS -- EOF]

If you have the file looking like:

[BOF -- HEAD -- data chunk 1 -- EODS -- something -- EOF]

it will be cut to valid form after re-saving it in many tools (but only if header defines data structure size).

Other way of predicting is to look for known files types beginnings (ex. MZ for executables, BM for bitmaps, PK for archives) and assume that file can be split when such signature is found.

Two most obvious disadvantages comes to mind:
1. file can be corrupted after splitting if there is no need for it (even if signature can be found; ex. "MZ" appears in totalcmd.exe 7 times but this is single executable only),
2. common formats signatures database is still neded.

chrizoo · Post by *chrizoo » 2008-12-09, 01:29 UTC

I understand all of that, but you are talking about the individual structure of files (headers, chunks, etc.), whereas I thought that all files (regardless of their file format) have BOF and EOF byte(s?) and these were the same for any type of file. For example:

BOF -- mp3 file -- EOF
BOF -- jpeg file -- EOF

I thought there should be one clearly defined byte sequence for BOF and EOF (or at least not more than 3 or 4 allowed sequences). But it might not be the case at all. I just got the impression that it would be like this, when reading the above text in italics.

So there isn't necessarily a BOF and EOF marker in a file and (e.g.) NTFS only knows where a file starts and ends because it stores the beginning position and the exact file length ?

fenix_productions · Post by *fenix_productions » 2008-12-09, 02:03 UTC

chrizoo wrote:I just got the impression that it would be like this, when reading the above text in italics.

There is the part about EOI (end of image - data structure) not EOF.

EOF is symbolic value returned by Operating System which tells that no more data belongs to file. It is not related to structure info but to file size limit. Teoritically you can't write something after that. You can write after data structure but not EOF. At least that is how I see that but I might be wrong

chrizoo wrote:So there isn't necessarily a BOF and EOF marker in a file and (e.g.) NTFS only knows where a file starts and ends because it stores the beginning position and the exact file length ?

NTFS stores in MFT the information about:

wiki wrote:"all files on the volume, including file names, timestamps, stream names, and lists of cluster numbers where data streams reside, indexes, security identifiers, and file attributes like "read only", "compressed", "encrypted", etc."

IIRC The size is calculated by operating system according to MFT info.

chrizoo · Post by *chrizoo » 2008-12-09, 04:17 UTC

So to put it simply (and not get lost in the details):

If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?

I would have thought, there is FOR EXAMPLE a three byte BOF marker 89 50 4E (hex) and a EOF marker 42 60 82, that would unambiguously identify start/end of ANY file. (values here completely arbitrary)

fenix_productions · Post by *fenix_productions » 2008-12-09, 08:38 UTC

chrizoo wrote:If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?

Indeed.

Post by *Hacker » 2008-12-09, 14:26 UTC

chrizoo,

If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?

It's just like with a normal paper page. Where it starts and where it ends is not defined by a piece of text "start reading here" and "stop reading there".

Roman

derBeobachter · Post by *derBeobachter » 2008-12-19, 17:57 UTC

Well, I think the idea of a 'split here' marker is a good one, why not let the user specify which marker character(s) to use? (FF is another good candidate.)