[idea] [pending] split files at EOF markers

Here you can propose new features, make suggestions etc.

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
chrizoo
Senior Member
Senior Member
Posts: 349
Joined: 2008-03-12, 02:42 UTC

[idea] [pending] split files at EOF markers

Post by *chrizoo »

Hi. Currently you can specify a fixed files size to split files in [menu: file>>split files]

suggestion: in addition to the file sizes, TC could offer to look for end of file (EOF) and begin of file markers and split the file accordingly.
(This would for example come in handy for embedded image thumbnails, cover art in mp3 tags, etc. )
User avatar
karlchen
Power Member
Power Member
Posts: 4603
Joined: 2003-02-06, 22:23 UTC
Location: Germany

Post by *karlchen »

Hi, chrizoo.

I am not quite sure what you have in mind when using the term "EOF marker".

As far as I recall only pure ASCII files once upon a time, light years ago, used ^Z (ASCII 26) in order to mark the end of the file.
I never see this ^Z anywhere any longer.

And what is a "begin of file marker"?

Karl
User avatar
chrizoo
Senior Member
Senior Member
Posts: 349
Joined: 2008-03-12, 02:42 UTC

Post by *chrizoo »

I'm not really an expert in this domain unfortunately. But I have a command line tool, that claims to do the following:

... looks through image files for a second SOI (start of image) marker. If it finds one, it copies that to a new file until it reaches an EOI marker.

That way, I can - for example extract all hidden thumbnails in an image. This could come in quite handy for other files, too. For example, there are often images in tagged music files (for example cover art in mp3 files).
I also once had a case, where I had to use a recovery program for a colleague's machine and some of the recovered files had some trash before and after the begin/end of file.

I don't know about ^Z etc. but I thought there might be one or more special character combinations to mark the beginning or end of file. And in this case TC could just split the file accordingly. For example:

[data chunk 1 -- BOF -- data chunk 2 -- EOF--BOF-- data chunk 3 -- EOF -- data chunk 4]

splitting to (different files):

[data chunk1]
[data chunk2]
[data chunk3]
[data chunk4]


There were a couple of other cases, where I thought this would be quite useful.
User avatar
fenix_productions
Power Member
Power Member
Posts: 1979
Joined: 2005-08-07, 13:23 UTC
Location: Poland
Contact:

Post by *fenix_productions »

2chrizoo
You can't do that in simple way.

Some files / images types defines it's end in header area. If you have something after it, viewers will not see that data and will cut additional parts after saving. You can't split it easily because it would require database with all known headers. Imagine the time of files processing for splitting purposes. You should also know that not every file has the size defined in header (ex. text files don't).

Your command line tool probably supports some types and works with them but it will never work for all. It simply reads file's header and extracts all data according to it. If there is something more it will be placed in additional file.

EOF can be something completely different from end of image. So your sample should look like:

[BOF -- HEAD -- data chunk 1 -- EODS -- HEAD -- data chunk 2 -- EODS -- HEAD -- data chunk 3 -- EODS -- EOF]

where:
BOF - beginning of file,
HEAD - structure header,
EODS - end of data structure (supposed end, according to header),
EOF - end of file.

Some files has:
[BOF -- HEAD -- data chunk 1 -- EODS -- EOF]

If you have the file looking like:

[BOF -- HEAD -- data chunk 1 -- EODS -- something -- EOF]

it will be cut to valid form after re-saving it in many tools (but only if header defines data structure size).

Other way of predicting is to look for known files types beginnings (ex. MZ for executables, BM for bitmaps, PK for archives) and assume that file can be split when such signature is found.

Two most obvious disadvantages comes to mind:
1. file can be corrupted after splitting if there is no need for it (even if signature can be found; ex. "MZ" appears in totalcmd.exe 7 times but this is single executable only),
2. common formats signatures database is still neded.
"When we created the poke, we thought it would be cool to have a feature without any specific purpose." Facebook...

#128099
User avatar
chrizoo
Senior Member
Senior Member
Posts: 349
Joined: 2008-03-12, 02:42 UTC

Post by *chrizoo »

I understand all of that, but you are talking about the individual structure of files (headers, chunks, etc.), whereas I thought that all files (regardless of their file format) have BOF and EOF byte(s?) and these were the same for any type of file. For example:

BOF -- mp3 file -- EOF
BOF -- jpeg file -- EOF

I thought there should be one clearly defined byte sequence for BOF and EOF (or at least not more than 3 or 4 allowed sequences). But it might not be the case at all. I just got the impression that it would be like this, when reading the above text in italics.

So there isn't necessarily a BOF and EOF marker in a file and (e.g.) NTFS only knows where a file starts and ends because it stores the beginning position and the exact file length ?
User avatar
fenix_productions
Power Member
Power Member
Posts: 1979
Joined: 2005-08-07, 13:23 UTC
Location: Poland
Contact:

Post by *fenix_productions »

chrizoo wrote:I just got the impression that it would be like this, when reading the above text in italics.
There is the part about EOI (end of image - data structure) not EOF.

EOF is symbolic value returned by Operating System which tells that no more data belongs to file. It is not related to structure info but to file size limit. Teoritically you can't write something after that. You can write after data structure but not EOF. At least that is how I see that but I might be wrong ;)
chrizoo wrote:So there isn't necessarily a BOF and EOF marker in a file and (e.g.) NTFS only knows where a file starts and ends because it stores the beginning position and the exact file length ?
NTFS stores in MFT the information about:
wiki wrote:"all files on the volume, including file names, timestamps, stream names, and lists of cluster numbers where data streams reside, indexes, security identifiers, and file attributes like "read only", "compressed", "encrypted", etc."
IIRC The size is calculated by operating system according to MFT info.
"When we created the poke, we thought it would be cool to have a feature without any specific purpose." Facebook...

#128099
User avatar
chrizoo
Senior Member
Senior Member
Posts: 349
Joined: 2008-03-12, 02:42 UTC

Post by *chrizoo »

So to put it simply (and not get lost in the details):

If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?

I would have thought, there is FOR EXAMPLE a three byte BOF marker 89 50 4E (hex) and a EOF marker 42 60 82, that would unambiguously identify start/end of ANY file. (values here completely arbitrary)
User avatar
fenix_productions
Power Member
Power Member
Posts: 1979
Joined: 2005-08-07, 13:23 UTC
Location: Poland
Contact:

Post by *fenix_productions »

chrizoo wrote:If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?
Indeed.
"When we created the poke, we thought it would be cool to have a feature without any specific purpose." Facebook...

#128099
User avatar
Hacker
Moderator
Moderator
Posts: 13081
Joined: 2003-02-06, 14:56 UTC
Location: Bratislava, Slovakia

Post by *Hacker »

chrizoo,
If you take any given file (regardless of its content), there is no such thing as a BOF and EOF byte sequence INDEPENDENT from its file format?
It's just like with a normal paper page. Where it starts and where it ends is not defined by a piece of text "start reading here" and "stop reading there". ;)

Roman
Mal angenommen, du drückst Strg+F, wählst die FTP-Verbindung (mit gespeichertem Passwort), klickst aber nicht auf Verbinden, sondern fällst tot um.
derBeobachter
Junior Member
Junior Member
Posts: 4
Joined: 2008-12-19, 16:22 UTC

Post by *derBeobachter »

Well, I think the idea of a 'split here' marker is a good one, why not let the user specify which marker character(s) to use? (FF is another good candidate.)
Post Reply