[WCX] RegXtract - String Extractor with RegEx
Moderators: Hacker, petermad, Stefan2, white
[WCX] RegXtract - String Extractor with RegEx
RegXtract packer plug-in
This is a Packer plug-in that searches file content with Regular Expressions and extracts the search results to an output file.
So what's the deal?
While you might find it useful to know that a particular search string exists in a file and of course where exactly in the file it can be found,
it is in some situations even more useful to save all resulting strings to work with these.
For example, if you want to extract all URLs from a HTML file (<a href="URL"...),
you can do a Regular Expression search with a lot of editors and tools quite easy, but saving the results is often a pain,
especially if you have a lot of results and/or a lot of files to search.
This plug-in helps you to do this quite easy.
Just type the Regular Expression, optionally a detailed Replacement String which can reference subgroups from the Expression.
Additionally you have a full Search and Replace mode and you can use the plug-in for visualizing Regular Expressions results.
Unicode-only plug-in, 32+64 bit versions, source provided.
Based on Perl Compatible Regular Expressions (PCRE) library 8.36.
You can use the full complexity of Expressions valid for the PCRE library, which includes the singleline mode "(?s)",
also called the dotall mode, which Total Commander can't provide.
Check the included HTML file for PCRE Syntax.
Features
- Searching in most Unicode files, not just plain ANSI and binary
- Using any Regular Expression that are valid for the built-in PCRE library, which is in most points compatible with the Perl syntax and features
- Using up to 99 subgroups (for the replacement string): $1 - $99 and $0 (the whole expression),
plus any desired text, single bytes, current result number, filename
- Unicode file names and Unicode Regular Expressions and replacement strings
- Long file names (path > 259 characters) usable
- Search and Replace strings - output file becomes input file with replaced strings/bytes - now also works for Unicode files!
- Suppress duplicate results - only first occurrence of individual strings issued
- Output file names and line numbers from which each search results originates (line mode)
- Save and load presets
- "Misuse" the program to merge/combine files (in input sequence) with or without additional replacement
- "Misuse" the program to remove or add BOM from UTF-8 file, or to convert UTF-16 or ANSI files to UTF-8
- Visualizing Regular Expressions results by typing a test string
- On-the-fly RegEx error check
Just select the files whose content you want to search,
select "Pack files..." and the "RegXtract" extension and use the "Configure..." dialog to do all settings.
New version 1.6!
totalcmd.net (32+64 bit + source)
SHA1: a7db5ce81a735ba5fc69eb6baf27b3603836696a
Old version 1.5:
Download (32+64 bit + source)
SHA1: a726544165a54448892af989e8b0c4d40a4c357c
Old version 1.1:
totalcmd.net
alternate hoster
SHA1: d6658111f2aeca06c1792842254761c38679f63d
Old version 1.0:
RegXtract_1_0.rar
SHA1: 5526000f8b8fa18b713775262dafbaee3b82ad0b
Please report bugs and give me some feedback.
Translators welcome!
As of version 1.5 there is a (single) .lng file provided, with each language code being in a new section of the same name.
If you think you have a good translation, send me an email via the board and I'll implement it in the next version.
This is a Packer plug-in that searches file content with Regular Expressions and extracts the search results to an output file.
So what's the deal?
While you might find it useful to know that a particular search string exists in a file and of course where exactly in the file it can be found,
it is in some situations even more useful to save all resulting strings to work with these.
For example, if you want to extract all URLs from a HTML file (<a href="URL"...),
you can do a Regular Expression search with a lot of editors and tools quite easy, but saving the results is often a pain,
especially if you have a lot of results and/or a lot of files to search.
This plug-in helps you to do this quite easy.
Just type the Regular Expression, optionally a detailed Replacement String which can reference subgroups from the Expression.
Additionally you have a full Search and Replace mode and you can use the plug-in for visualizing Regular Expressions results.
Unicode-only plug-in, 32+64 bit versions, source provided.
Based on Perl Compatible Regular Expressions (PCRE) library 8.36.
You can use the full complexity of Expressions valid for the PCRE library, which includes the singleline mode "(?s)",
also called the dotall mode, which Total Commander can't provide.
Check the included HTML file for PCRE Syntax.
Features
- Searching in most Unicode files, not just plain ANSI and binary
- Using any Regular Expression that are valid for the built-in PCRE library, which is in most points compatible with the Perl syntax and features
- Using up to 99 subgroups (for the replacement string): $1 - $99 and $0 (the whole expression),
plus any desired text, single bytes, current result number, filename
- Unicode file names and Unicode Regular Expressions and replacement strings
- Long file names (path > 259 characters) usable
- Search and Replace strings - output file becomes input file with replaced strings/bytes - now also works for Unicode files!
- Suppress duplicate results - only first occurrence of individual strings issued
- Output file names and line numbers from which each search results originates (line mode)
- Save and load presets
- "Misuse" the program to merge/combine files (in input sequence) with or without additional replacement
- "Misuse" the program to remove or add BOM from UTF-8 file, or to convert UTF-16 or ANSI files to UTF-8
- Visualizing Regular Expressions results by typing a test string
- On-the-fly RegEx error check
Just select the files whose content you want to search,
select "Pack files..." and the "RegXtract" extension and use the "Configure..." dialog to do all settings.
New version 1.6!
totalcmd.net (32+64 bit + source)
SHA1: a7db5ce81a735ba5fc69eb6baf27b3603836696a
Old version 1.5:
Download (32+64 bit + source)
SHA1: a726544165a54448892af989e8b0c4d40a4c357c
Old version 1.1:
totalcmd.net
alternate hoster
SHA1: d6658111f2aeca06c1792842254761c38679f63d
Old version 1.0:
RegXtract_1_0.rar
SHA1: 5526000f8b8fa18b713775262dafbaee3b82ad0b
Please report bugs and give me some feedback.
Translators welcome!
As of version 1.5 there is a (single) .lng file provided, with each language code being in a new section of the same name.
If you think you have a good translation, send me an email via the board and I'll implement it in the next version.
Last edited by milo1012 on 2014-11-06, 03:49 UTC, edited 8 times in total.
here's a mirror:
http://xxeccon.xe.ohost.de/RegXtract_1_0.rar
i haven't tried the plugin yet, many thanks!
http://xxeccon.xe.ohost.de/RegXtract_1_0.rar
i haven't tried the plugin yet, many thanks!
licenced and happy TC user since 1994 (#11xx)
TW wrote:here's a mirror:
http://xxeccon.xe.ohost.de/RegXtract_1_0.rar
Many thx for the mirrors!Alextp wrote:http://wincmd.ru/files/wcx_RegXtract_1_0.rar
Well it seems that I have to send my complete desired login data per mail.Alextp wrote:...create an account. Then u can upload.
May I change at least my password after that?
New Version 1.1!
Added: Font selection for Regex and Replace/Teststring box
Added: Backward search ("Prev") for Expression test
Added: The last configuration settings are now remembered and saved to the RegXtract.ini file -> reuse last settings when TC restarts
Added: "Delete file if copied" option (after Clipboard copy) -> 5 MiB limit
Added: Horizontal scroll for Replace/Teststring box
...
Changed: Increased height for Replace/Teststring box
...
Fixed: Possible crash in Replace String parser, especially in 64 bit version
Fixed: Copy to Clipboard could randomly fail for > 3-4 MiB
...
See the readme file for more details.
Check the first post for the new file.
Added: Font selection for Regex and Replace/Teststring box
Added: Backward search ("Prev") for Expression test
Added: The last configuration settings are now remembered and saved to the RegXtract.ini file -> reuse last settings when TC restarts
Added: "Delete file if copied" option (after Clipboard copy) -> 5 MiB limit
Added: Horizontal scroll for Replace/Teststring box
...
Changed: Increased height for Replace/Teststring box
...
Fixed: Possible crash in Replace String parser, especially in 64 bit version
Fixed: Copy to Clipboard could randomly fail for > 3-4 MiB
...
See the readme file for more details.
Check the first post for the new file.
http://www.totalcmd.net/files/wcx_RegXtract_1_1.rar
Thanks for the new version!
P.S: Just a little sidenote, even though its 1.1 now, the version number is still left at "1.0.0.0" - if this is not intentional, please increment it next time Thanks!
Thanks for the new version!
P.S: Just a little sidenote, even though its 1.1 now, the version number is still left at "1.0.0.0" - if this is not intentional, please increment it next time Thanks!
Thx for the upload!Bluestar wrote:http://www.totalcmd.net/files/wcx_RegXtract_1_1.rar...
P.S: Just a little sidenote, even though its 1.1 now, the version number is still left at "1.0.0.0"
Which number do you mean, the embedded Versioninfo resource?
I just checked it again and it's correctly updated to 1.1.
Many thanks for the correct description and linking the URLs, I really appreciate it!Alextp wrote:http://www.totalcmd.net/plugring/reg_ext.html
Last edited by milo1012 on 2013-11-26, 00:36 UTC, edited 1 time in total.
to extract all URL's from a *.txt file
Good afternoon, is it possible to add function to this useful plugin for extracting all URL's from a *.txt file (plain text) ?
Re: to extract all URL's from a *.txt file
You just need a fitting Regular Expression for that,trevor12 wrote:Good afternoon, is it possible to add function to this useful plugin for extracting all URL's from a *.txt file (plain text) ?
which you can save to a preset for reuse.
There is no need to alter the program for that.
It depends on what format is used in the text file.
Give me an example on how the URLs are distributed in your file(s) and maybe I can provide an Expression.
re
unfortunately, these plain text files contains almost every type of url¨s:
for example
http://www.server.com/somescript=someparametres,***continues here text that is inot part of url ¨s adress***, http://somedomain.com(co.uk,..), somedomain.com (biz...), www.somedomain.com (gov..), https://www.google.cz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&ved=0CFoQFjAC&url=http%3A%2F%2Fwww.something.com%2F&ei=qCOzUpvBNI-ShQfpioDYDg&usg=AF ... Gd0aYjqK7Q,
http://portal..somedomain.cz/phorum/read.php?f=2&i=1283255&t=1283255,
http://forum.something.com/forum.cgi?action=filter&forum=5&filterby=topictitle&word=windows
http://www.youtube.com/watch?v=QH2-TGUlwu4
www.somedomain.com/advice&search=?black+white
http://somedmain.com/Search.aspx?k=url&c=-1&m=-1&ps=100
http://www.somedomain.org/something/notallowed&/page1.h
http://www.mysomedomain.com/?find1=not
http://anything.com/article.aspx?id=123
http://anything/where.somedomain.net/something/index.asp
********
and of course some url' s are ftp://, some url' s have specific port for example http://anything.com:8080 and last but not least some "links" are mail addresses: format of mailto://somebody@somedomain.com or only somebody@somedoamin.org etc...
the "links" are eeparated from other content of plain text by many symbols - nothing, dash, colon, semi-colon, space ..
*******
probably it is not trivial if resolvable challenge ...
for example
http://www.server.com/somescript=someparametres,***continues here text that is inot part of url ¨s adress***, http://somedomain.com(co.uk,..), somedomain.com (biz...), www.somedomain.com (gov..), https://www.google.cz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&ved=0CFoQFjAC&url=http%3A%2F%2Fwww.something.com%2F&ei=qCOzUpvBNI-ShQfpioDYDg&usg=AF ... Gd0aYjqK7Q,
http://portal..somedomain.cz/phorum/read.php?f=2&i=1283255&t=1283255,
http://forum.something.com/forum.cgi?action=filter&forum=5&filterby=topictitle&word=windows
http://www.youtube.com/watch?v=QH2-TGUlwu4
www.somedomain.com/advice&search=?black+white
http://somedmain.com/Search.aspx?k=url&c=-1&m=-1&ps=100
http://www.somedomain.org/something/notallowed&/page1.h
http://www.mysomedomain.com/?find1=not
http://anything.com/article.aspx?id=123
http://anything/where.somedomain.net/something/index.asp
********
and of course some url' s are ftp://, some url' s have specific port for example http://anything.com:8080 and last but not least some "links" are mail addresses: format of mailto://somebody@somedomain.com or only somebody@somedoamin.org etc...
the "links" are eeparated from other content of plain text by many symbols - nothing, dash, colon, semi-colon, space ..
*******
probably it is not trivial if resolvable challenge ...
Re: re
Well, if the URLs aren't separated decently, it really is impossible to extract all at once with a single Expression.trevor12 wrote:unfortunately, these plain text files contains almost every type of url¨s
We must allow certain characters in the URL, see e.g. here.
That also includes the dash, colon, and semicolon.
So a possible Expression could be:
[face=courier](?:http://|https://|ftp://|www\.|mailto:)[!\*'\(\);:@&=\+\$,/\?#\[\]A-Za-z0-9\-_\.~%]+[/face]
(Replacement is just $0)
There will probably be a lot of URLs not working with this.
Just try it for yourself and check which URLs don't match.
If you allow the colon and semi-colon as separators, you must remove them from the expression,
but this will dismiss URL parts starting with these:
[face=courier](?:http://|https://|ftp://|www\.|mailto:)[!\*'\(\)@&=\+\$,/\?#\[\]A-Za-z0-9\-_\.~%]+[/face]
Also removing the dash is probably not a good idea, since many domains use them.
Generally, isn't parsing HTML just (=definitely, provably) beyond what one can do with regular expressions?
I'm afraid that applies even if they're of the "extended" kind (ie more powerful than "regular" in the strict CS sense).
Nevertheless, a plugin coming near even only what's possible at all might still be very useful
I'm afraid that applies even if they're of the "extended" kind (ie more powerful than "regular" in the strict CS sense).
Nevertheless, a plugin coming near even only what's possible at all might still be very useful