[BUG] Help. What are limits of RegEx?
Moderators: Hacker, petermad, Stefan2, white
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
[BUG] Help. What are limits of RegEx?
UPDATE:
This: \x00\x00 RegEX (2) succeeds.
This: \x00\x00\x01 RegEX (2) fails.
What is going on with RegEx?
MORE UPDATE:
It looks like RegEx is totally broken.
====
How can I get this regex to work?
\x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.)
This: \x00\x00\x01\xB5 RegEX (2) fails.
This: \x00\x00\x01\xB5 Hex succeeds.
This: \x00\x00\x01\xB5. Hex succeeds but selects only 4 bytes (the '.' is ignored).
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) RegEx (2) fails.
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) Hex fails.
What am I doing wrong?
Thanks so much,
Mark.
Of course you want to know what this is about, eh? I'm searching the binary contents of DVDs (VOB files) looking for particular 'sequence_extension' metadata.
I have yet to see a DVD that has this: 0x00 00 01 B5 1? ??, where '? ??' is other than '4 82' (i.e. MP@ML plus !progressive_sequence plus 4:2:0). (Note '<<== all DVDs?' in the table, below.) The above regex performs such a search and reports 0x00 00 01 B5 1 plus something ('? ??'), but not '4 82'.
The pattern is the 'sequence_extension' header ID metadata followed by 'profile_and_level_indication' -- the combinations are shown in the table, below -- which is followed by 'progressive_sequence' followed by 'chroma_format'.
0x00 00 01 B5 11 2 High@HighP
0x00 00 01 B5 11 4 High@High
0x00 00 01 B5 11 6 High@High1440
0x00 00 01 B5 11 8 High@Main
0x00 00 01 B5 11 A High@Low
0x00 00 01 B5 12 2 SpaciallyScalable@HighP
0x00 00 01 B5 12 4 SpaciallyScalable@High
0x00 00 01 B5 12 6 SpaciallyScalable@High1440
0x00 00 01 B5 12 8 SpaciallyScalable@Main
0x00 00 01 B5 12 A SpaciallyScalable@Low
0x00 00 01 B5 13 2 SNRScalable@HighP
0x00 00 01 B5 13 4 SNRScalable@High
0x00 00 01 B5 13 6 SNRScalable@High1440
0x00 00 01 B5 13 8 SNRScalable@Main
0x00 00 01 B5 13 A SNRScalable@Low
0x00 00 01 B5 14 2 Main@HighP
0x00 00 01 B5 14 4 Main@High
0x00 00 01 B5 14 6 Main@High1440
0x00 00 01 B5 14 8 Main@Main <<== all DVDs?
0x00 00 01 B5 14 A Main@Low
0x00 00 01 B5 15 2 Simple@HighP
0x00 00 01 B5 15 4 Simple@High
0x00 00 01 B5 15 6 Simple@High1440
0x00 00 01 B5 15 8 Simple@Main
0x00 00 01 B5 15 A Simple@Low
0x00 00 01 B5 18 E Multi-view@Low
0x00 00 01 B5 18 D Multi-view@Main
0x00 00 01 B5 18 B Multi-view@High1440
0x00 00 01 B5 18 A Multi-view@High
0x00 00 01 B5 18 5 4:2:2@Main
0x00 00 01 B5 18 2 4:2:2@High
This: \x00\x00 RegEX (2) succeeds.
This: \x00\x00\x01 RegEX (2) fails.
What is going on with RegEx?
MORE UPDATE:
It looks like RegEx is totally broken.
====
How can I get this regex to work?
\x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.)
This: \x00\x00\x01\xB5 RegEX (2) fails.
This: \x00\x00\x01\xB5 Hex succeeds.
This: \x00\x00\x01\xB5. Hex succeeds but selects only 4 bytes (the '.' is ignored).
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) RegEx (2) fails.
This: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.) Hex fails.
What am I doing wrong?
Thanks so much,
Mark.
Of course you want to know what this is about, eh? I'm searching the binary contents of DVDs (VOB files) looking for particular 'sequence_extension' metadata.
I have yet to see a DVD that has this: 0x00 00 01 B5 1? ??, where '? ??' is other than '4 82' (i.e. MP@ML plus !progressive_sequence plus 4:2:0). (Note '<<== all DVDs?' in the table, below.) The above regex performs such a search and reports 0x00 00 01 B5 1 plus something ('? ??'), but not '4 82'.
The pattern is the 'sequence_extension' header ID metadata followed by 'profile_and_level_indication' -- the combinations are shown in the table, below -- which is followed by 'progressive_sequence' followed by 'chroma_format'.
0x00 00 01 B5 11 2 High@HighP
0x00 00 01 B5 11 4 High@High
0x00 00 01 B5 11 6 High@High1440
0x00 00 01 B5 11 8 High@Main
0x00 00 01 B5 11 A High@Low
0x00 00 01 B5 12 2 SpaciallyScalable@HighP
0x00 00 01 B5 12 4 SpaciallyScalable@High
0x00 00 01 B5 12 6 SpaciallyScalable@High1440
0x00 00 01 B5 12 8 SpaciallyScalable@Main
0x00 00 01 B5 12 A SpaciallyScalable@Low
0x00 00 01 B5 13 2 SNRScalable@HighP
0x00 00 01 B5 13 4 SNRScalable@High
0x00 00 01 B5 13 6 SNRScalable@High1440
0x00 00 01 B5 13 8 SNRScalable@Main
0x00 00 01 B5 13 A SNRScalable@Low
0x00 00 01 B5 14 2 Main@HighP
0x00 00 01 B5 14 4 Main@High
0x00 00 01 B5 14 6 Main@High1440
0x00 00 01 B5 14 8 Main@Main <<== all DVDs?
0x00 00 01 B5 14 A Main@Low
0x00 00 01 B5 15 2 Simple@HighP
0x00 00 01 B5 15 4 Simple@High
0x00 00 01 B5 15 6 Simple@High1440
0x00 00 01 B5 15 8 Simple@Main
0x00 00 01 B5 15 A Simple@Low
0x00 00 01 B5 18 E Multi-view@Low
0x00 00 01 B5 18 D Multi-view@Main
0x00 00 01 B5 18 B Multi-view@High1440
0x00 00 01 B5 18 A Multi-view@High
0x00 00 01 B5 18 5 4:2:2@Main
0x00 00 01 B5 18 2 4:2:2@High
Last edited by MarkFilipak on 2020-10-12, 16:40 UTC, edited 1 time in total.
Hi Christian! Delighted customer since 1999. License #37627
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: Help. What are limits of RegEx?
Thanks for the bad news. Well, that makes RegEx a sham, doesn't it?
Hi Christian! Delighted customer since 1999. License #37627
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: Help. What are limits of RegEx?
No, it's not a sham. Why would it be a sham? Just because it is a limitation makes TC's regex limited, but not a sham.MarkFilipak wrote: 2020-10-12, 16:23 UTCThanks for the bad news. Well, that makes RegEx a sham, doesn't it?
Less angry hyperbole, please...
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: Help. What are limits of RegEx?
Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.
The problem is that when it finds '\x00', it selects the '\x00' plus the next byte (as though it had searched for '\x00.'). The search index is now off by 1 byte, so the remainder of the search string fails (even though the target does exist).
This is just a bug.
Hi Christian! Delighted customer since 1999. License #37627
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: Help. What are limits of RegEx?
TC does not work reliably with \x00, as you found out. As i mentioned in the post i linked to, Ghisler has mentioned already in the past that \x00 and \x0000 don't really work reliably. At this point it is moot to argue whether regex patterns with \x00 or \x0000 match something incorrectly, or not match at all, because it boils down to the same thing: Patterns with \x00 or \x0000 don't really work, unfortunately.MarkFilipak wrote: 2020-10-12, 16:39 UTC Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.

- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: Help. What are limits of RegEx?
You should not dismiss a bug as an undocumented design 'feature'. It's a bug. It should be fixed.elgonzo wrote: 2020-10-12, 16:44 UTCTC does not work reliably with \x00, as you found out. As i mentioned in the post i linked to, Ghisler has mentioned already in the past that \x00 and \x0000 don't really work reliably. At this point it is moot to argue whether regex patterns with \x00 or \x0000 match something incorrectly, or not match at all, because it boils down to the same thing: Patterns with \x00 or \x0000 don't really work, unfortunately.MarkFilipak wrote: 2020-10-12, 16:39 UTC Actually, what you wrote is incorrect. It's not that RegEx can't find '\x00'. It does.![]()
Hi Christian! Delighted customer since 1999. License #37627
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: [BUG] Help. What are limits of RegEx?
My comment about it being a limitation was a response to you exclaiming "Sham", and the only thing it is intended to be dismissive of is this exclamation of "Sham". It's not something i pulled out my nose either; look what Ghisler wrote here some time ago: https://www.ghisler.ch/board/viewtopic.php?p=224760#p224760
By the way, if you look around in the forum, you'll notice several other users in the past having stumbled over the \x00 / \x0000 issue. It would indeed be nice and much better if this were to be fixed (in the sense that patterns with \x00 and \x000 are properly functioning), i am not disagreeing with you in this regard (but short of this becoming reality, TC or its help file should spell out this limitation and not let users run into and troubleshoot the issue again and again and again...)
By the way, if you look around in the forum, you'll notice several other users in the past having stumbled over the \x00 / \x0000 issue. It would indeed be nice and much better if this were to be fixed (in the sense that patterns with \x00 and \x000 are properly functioning), i am not disagreeing with you in this regard (but short of this becoming reality, TC or its help file should spell out this limitation and not let users run into and troubleshoot the issue again and again and again...)
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: [BUG] Help. What are limits of RegEx?
Thank you for that... kind of you.elgonzo wrote: 2020-10-12, 16:58 UTC My comment about it being a limitation was a response to you exclaiming "Sham", and the only thing it is intended to be dismissive of is this exclamation of "Sham". It's not something i pulled out my nose either; look what Ghisler wrote here some time ago: https://www.ghisler.ch/board/viewtopic.php?p=224760#p224760
I posted there.
Hi Christian! Delighted customer since 1999. License #37627
Re: [BUG] Help. What are limits of RegEx?
I cannot really give any real support on this issue(!), but I notice that probably nobody has tried yet,
in Total Commander to use the regex library of 'Everything' in a search query.
Please see: Search queries in TC using 'Everything' - point 3 RegEx - Regular Expressions
'Everything' uses 'Perl Compatible Regular Expressions (PCRE)'.
Here it is stated that there was a successful query, e.g. with Notepad++.
Notepad++ regular expressions use the Boost regular expression library v1.70,
which is based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways.
Windows 10 Pro (x64) Version 2004 (OS build 19041.546)
TC 9.51 x64/x86 | 'Everything'-Version 1.4.1.993 (x64)
☑ 'Everything' | Search queries: TC <=> 'Everything'
in Total Commander to use the regex library of 'Everything' in a search query.
Please see: Search queries in TC using 'Everything' - point 3 RegEx - Regular Expressions
'Everything' uses 'Perl Compatible Regular Expressions (PCRE)'.
Here it is stated that there was a successful query, e.g. with Notepad++.
Notepad++ regular expressions use the Boost regular expression library v1.70,
which is based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways.
Windows 10 Pro (x64) Version 2004 (OS build 19041.546)
TC 9.51 x64/x86 | 'Everything'-Version 1.4.1.993 (x64)
☑ 'Everything' | Search queries: TC <=> 'Everything'
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: [BUG] Help. What are limits of RegEx?
I believe that's the regexp with which I'm familiar. I don't know what Everything is.tuska wrote: 2020-10-12, 21:16 UTC I cannot really give any real support on this issue(!), but I notice that probably nobody has tried yet,
in Total Commander to use the regex library of 'Everything' in a search query.
Please see: Search queries in TC using 'Everything' - point 3 RegEx - Regular Expressions
'Everything' uses 'Perl Compatible Regular Expressions (PCRE)'.
I didn't understand what was being discussed there. I didn't know what this: "Problem is that he need to search files that contains 00 bytes only (entire file filled with 00 bytes), but not files that contain at least one 00 byte", meant.Here it is stated that there was a successful query, e.g. with Notepad++.
I think that's what I've used. I don't use POSIX, ever.Notepad++ regular expressions use the Boost regular expression library v1.70,
which is based on PCRE (Perl Compatible Regular Expression) syntax, only departing from it in very minor ways.
I think that the problem with regular expression processing is that it's character/line oriented, so is crippled with architecture that can't properly address the entire [\x00-\xFF]. I've researched a Linux method you might want to comment on:
Code: Select all
Implements this regexp: \x00\x00\x01\xB5(\x11.|\x12.|\x13.|\x14[^\x82]|\x15.|\x16.|\x17.|\x18.|\x19.|\x1A.|\x1B.|\x1C.|\x1D.|\x1E.|\x1F.)
Converts the hex nibbles in FILE into textual values: 0 1 2 3 4 5 6 7 8 9 A B C D E F.
| Deletes the '\n's that xxd inserts -- turns the lines of nibbles into one huge string.
| | Finds '000001B51???' where '???' is not '482' -- returns either '0' (not found) or '1' (found).
| | |
xxd -p -u FILENAME | tr -d '\n' | grep -E -c '000001B51([0-35-9A-F]|4([0-79A-F]|8[0134-9A-F]))'
Hi Christian! Delighted customer since 1999. License #37627
Re: [BUG] Help. What are limits of RegEx?
In my signature above there are links to this tool and a documentation to help.Search queries in TC using 'Everything' wrote:As of Total Commander 9.0, the tool 'Everything' can be integrated into a search query with its own search parameters.
Based on your RegEx examples above (works/does not work) I just tried to give the hint,
that a RegEx query would also be possible by using TC [TRegExpr] with integration of the tool 'Everything', which uses PCRE.
If my assumption concerning the solution with Notepad++ was wrong, I am very sorry.
As already mentioned above, I cannot give you professional support due to insufficient knowledge (e.g. regarding RegEx queries).
Regards,
Karl
Re: [BUG] Help. What are limits of RegEx?
Regexp is not meant for binary search nor for signature matching. It a a character matching library and in most case single line. Most of the implementation uses for the match a string representation with dedicated charset/string encoding.
And in most case depending of the charset, unicode, .... \x00 do not match \x00 !
It does not solves your issue or at least explain the miss-use of the hex expression for a string based search.
Converting a binary file to an text file with hex representation could solves your issue looking for plain numbers in regex. With large VOB files, it is probably better to first extract metadata only and then convert !
Also you could see if media info cannot helps you to get some part of the info you need from the vob file.
And in most case depending of the charset, unicode, .... \x00 do not match \x00 !
It does not solves your issue or at least explain the miss-use of the hex expression for a string based search.
Converting a binary file to an text file with hex representation could solves your issue looking for plain numbers in regex. With large VOB files, it is probably better to first extract metadata only and then convert !
Also you could see if media info cannot helps you to get some part of the info you need from the vob file.
- MarkFilipak
- Member
- Posts: 164
- Joined: 2008-09-28, 01:00 UTC
- Location: Mansfield, Ohio
Re: [BUG] Help. What are limits of RegEx?
I think that grep is for character search and that regexp is a tool for matching patterns that include any combination of bits. I think that grep is hobbled by an original design that was unnecessarily narrow and case specific. Respectfully, I think you may be conflating grep and the underlaying regexp and attributing the shortcomings of grep to regexp, for example, limiting patterns to lines.nsp wrote: 2020-10-13, 12:55 UTC Regexp is not meant for binary search nor for signature matching. It a a character matching library and in most case single line. Most of the implementation uses for the match a string representation with dedicated charset/string encoding.
I don't understand '\x00 do not match \x00' -- a negative tautology? \x00 is \x00. Are you saying that regexp cannot handle such a pattern? Since that is clearly untrue, I'm unsure what you mean, but no matter.And in most case depending of the charset, unicode, .... \x00 do not match \x00 !
I fail to understand your assertion that a hex-based pattern is a misuse of regexp. Clearly regexp includes hex.It does not solves your issue or at least explain the miss-use of the hex expression for a string based search.
Can you suggest how I can extract metadata without regexp to do the search? I see no alternative.Converting a binary file to an text file with hex representation could solves your issue looking for plain numbers in regex. With large VOB files, it is probably better to first extract metadata only and then convert !
Do you mean the Mediainfo application? I need to parse everything in an MPEG stream. Mediainfo's resolution is far too narrow. What I'm looking for the answer to whether, in practice, DVDs always use HP@HL profile & level or always have 4:2:0 samples. It seems to me that I need to search a large number of VOBs to answer such questions. Even if it answered such questions, Mediainfo doesn't support either querry or bulk search.Also you could see if media info cannot helps you to get some part of the info you need from the vob file.
Thanks for your comments.
Hi Christian! Delighted customer since 1999. License #37627
Re: [BUG] Help. What are limits of RegEx?
Probably a bit OffTopic, but why don't you use more or less "dedicated" tools for this? E.g. VobEdit can show/interpret the sequence extension quite clearly. Just open the first vob file of the (main DVD) video stream and navigate to the first video pack / I frame pack.MarkFilipak wrote: 2020-10-12, 15:34 UTC Of course you want to know what this is about, eh? I'm searching the binary contents of DVDs (VOB files) looking for particular 'sequence_extension' metadata.
I have yet to see a DVD that has this: 0x00 00 01 B5 1? ??, where '? ??' is other than '4 82' (i.e. MP@ML plus !progressive_sequence plus 4:2:0). (Note '<<== all DVDs?' in the table, below.) The above regex performs such a search and reports 0x00 00 01 B5 1 plus something ('? ??'), but not '4 82'.
The pattern is the 'sequence_extension' header ID metadata followed by 'profile_and_level_indication' -- the combinations are shown in the table, below -- which is followed by 'progressive_sequence' followed by 'chroma_format'.
AFAIR dgindex can show the basic stream information (level@profile)as well, like probably a lot of other tools.
And back to topic: TC's limitations when it comes to splitting file content on newlines were the main reason why I started PCREsearch plugin. Your RegEx will probably work with it.
TC plugins: PCREsearch and RegXtract