[WCX] RegXtract - String Extractor with RegEx

Discuss and announce Total Commander plugins, addons and other useful tools here, both their usage and their development.

Moderators: white, Hacker, petermad, Stefan2

User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

[WCX] RegXtract - String Extractor with RegEx

Post by *milo1012 »

RegXtract packer plug-in

This is a Packer plug-in that searches file content with Regular Expressions and extracts the search results to an output file.

So what's the deal?
While you might find it useful to know that a particular search string exists in a file and of course where exactly in the file it can be found,
it is in some situations even more useful to save all resulting strings to work with these.
For example, if you want to extract all URLs from a HTML file (<a href="URL"...),
you can do a Regular Expression search with a lot of editors and tools quite easy, but saving the results is often a pain,
especially if you have a lot of results and/or a lot of files to search.

This plug-in helps you to do this quite easy.
Just type the Regular Expression, optionally a detailed Replacement String which can reference subgroups from the Expression.
Additionally you have a full Search and Replace mode and you can use the plug-in for visualizing Regular Expressions results.




Unicode-only plug-in, 32+64 bit versions, source provided.
Based on Perl Compatible Regular Expressions (PCRE) library 8.36.
You can use the full complexity of Expressions valid for the PCRE library, which includes the singleline mode "(?s)",
also called the dotall mode, which Total Commander can't provide.
Check the included HTML file for PCRE Syntax.

Features
- Searching in most Unicode files, not just plain ANSI and binary
- Using any Regular Expression that are valid for the built-in PCRE library, which is in most points compatible with the Perl syntax and features
- Using up to 99 subgroups (for the replacement string): $1 - $99 and $0 (the whole expression),
plus any desired text, single bytes, current result number, filename
- Unicode file names and Unicode Regular Expressions and replacement strings
- Long file names (path > 259 characters) usable
- Search and Replace strings - output file becomes input file with replaced strings/bytes - now also works for Unicode files!
- Suppress duplicate results - only first occurrence of individual strings issued
- Output file names and line numbers from which each search results originates (line mode)
- Save and load presets
- "Misuse" the program to merge/combine files (in input sequence) with or without additional replacement
- "Misuse" the program to remove or add BOM from UTF-8 file, or to convert UTF-16 or ANSI files to UTF-8
- Visualizing Regular Expressions results by typing a test string
- On-the-fly RegEx error check

Just select the files whose content you want to search,
select "Pack files..." and the "RegXtract" extension and use the "Configure..." dialog to do all settings.



New version 1.6!
totalcmd.net (32+64 bit + source)
SHA1: a7db5ce81a735ba5fc69eb6baf27b3603836696a


Old version 1.5:
Download (32+64 bit + source)
SHA1: a726544165a54448892af989e8b0c4d40a4c357c



Old version 1.1:
totalcmd.net
alternate hoster
SHA1: d6658111f2aeca06c1792842254761c38679f63d



Old version 1.0:
RegXtract_1_0.rar
SHA1: 5526000f8b8fa18b713775262dafbaee3b82ad0b



Please report bugs and give me some feedback.

Translators welcome!
As of version 1.5 there is a (single) .lng file provided, with each language code being in a new section of the same name.
If you think you have a good translation, send me an email via the board and I'll implement it in the next version.
Last edited by milo1012 on 2014-11-06, 03:49 UTC, edited 8 times in total.
TW
Senior Member
Senior Member
Posts: 383
Joined: 2005-01-19, 13:35 UTC

Post by *TW »

here's a mirror:

http://xxeccon.xe.ohost.de/RegXtract_1_0.rar

i haven't tried the plugin yet, many thanks!
licenced and happy TC user since 1994 (#11xx)
User avatar
Alextp
Power Member
Power Member
Posts: 2321
Joined: 2004-08-16, 22:35 UTC
Location: Russian Federation
Contact:

Post by *Alextp »

You can upload @ totalcmd.net. It's easy. press "Submit plugin" link at totalcmd.net, and create an account. Then u can upload.
User avatar
Alextp
Power Member
Power Member
Posts: 2321
Joined: 2004-08-16, 22:35 UTC
Location: Russian Federation
Contact:

Post by *Alextp »

User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

Many thx for the mirrors!
Alextp wrote:...create an account. Then u can upload.
Well it seems that I have to send my complete desired login data per mail.
May I change at least my password after that?
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

New Version 1.1!

Added: Font selection for Regex and Replace/Teststring box
Added: Backward search ("Prev") for Expression test
Added: The last configuration settings are now remembered and saved to the RegXtract.ini file -> reuse last settings when TC restarts
Added: "Delete file if copied" option (after Clipboard copy) -> 5 MiB limit
Added: Horizontal scroll for Replace/Teststring box
...
Changed: Increased height for Replace/Teststring box
...
Fixed: Possible crash in Replace String parser, especially in 64 bit version
Fixed: Copy to Clipboard could randomly fail for > 3-4 MiB
...

See the readme file for more details.

Check the first post for the new file.
User avatar
Bluestar
Senior Member
Senior Member
Posts: 377
Joined: 2007-06-10, 15:26 UTC
Location: Hungary
Contact:

Post by *Bluestar »

http://www.totalcmd.net/files/wcx_RegXtract_1_1.rar

Thanks for the new version!

P.S: Just a little sidenote, even though its 1.1 now, the version number is still left at "1.0.0.0" - if this is not intentional, please increment it next time :) Thanks!
» Developer of Total Updater & extDir utility.
User avatar
Alextp
Power Member
Power Member
Posts: 2321
Joined: 2004-08-16, 22:35 UTC
Location: Russian Federation
Contact:

Post by *Alextp »

User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

Bluestar wrote:http://www.totalcmd.net/files/wcx_RegXtract_1_1.rar...
P.S: Just a little sidenote, even though its 1.1 now, the version number is still left at "1.0.0.0"
Thx for the upload!
Which number do you mean, the embedded Versioninfo resource?
I just checked it again and it's correctly updated to 1.1.

Many thanks for the correct description and linking the URLs, I really appreciate it!
Last edited by milo1012 on 2013-11-26, 00:36 UTC, edited 1 time in total.
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

to extract all URL's from a *.txt file

Post by *trevor12 »

Good afternoon, is it possible to add function to this useful plugin for extracting all URL's from a *.txt file (plain text) ?
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: to extract all URL's from a *.txt file

Post by *milo1012 »

trevor12 wrote:Good afternoon, is it possible to add function to this useful plugin for extracting all URL's from a *.txt file (plain text) ?
You just need a fitting Regular Expression for that,
which you can save to a preset for reuse.
There is no need to alter the program for that.

It depends on what format is used in the text file.
Give me an example on how the URLs are distributed in your file(s) and maybe I can provide an Expression.
adoeller
Junior Member
Junior Member
Posts: 82
Joined: 2011-05-23, 09:47 UTC

Post by *adoeller »

Hi,

what a really great plugin. it is a cool idea and saves a lot of time.
Thank you very much.

one glitch maybe. the stats were written when adding a new setting. i have to enable and disable the button, then it works as it should.

Thanks again,
Alex
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

re

Post by *trevor12 »

unfortunately, these plain text files contains almost every type of url¨s:
for example
http://www.server.com/somescript=someparametres,***continues here text that is inot part of url ¨s adress***, http://somedomain.com(co.uk,..), somedomain.com (biz...), www.somedomain.com (gov..), https://www.google.cz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&cad=rja&ved=0CFoQFjAC&url=http%3A%2F%2Fwww.something.com%2F&ei=qCOzUpvBNI-ShQfpioDYDg&usg=AF ... Gd0aYjqK7Q,
http://portal..somedomain.cz/phorum/read.php?f=2&i=1283255&t=1283255,
http://forum.something.com/forum.cgi?action=filter&forum=5&filterby=topictitle&word=windows
http://www.youtube.com/watch?v=QH2-TGUlwu4
www.somedomain.com/advice&search=?black+white
http://somedmain.com/Search.aspx?k=url&c=-1&m=-1&ps=100
http://www.somedomain.org/something/notallowed&/page1.h
http://www.mysomedomain.com/?find1=not
http://anything.com/article.aspx?id=123
http://anything/where.somedomain.net/something/index.asp
********
and of course some url' s are ftp://, some url' s have specific port for example http://anything.com:8080 and last but not least some "links" are mail addresses: format of mailto://somebody@somedomain.com or only somebody@somedoamin.org etc...

the "links" are eeparated from other content of plain text by many symbols - nothing, dash, colon, semi-colon, space ..

*******
probably it is not trivial if resolvable challenge ...
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: re

Post by *milo1012 »

trevor12 wrote:unfortunately, these plain text files contains almost every type of url¨s
Well, if the URLs aren't separated decently, it really is impossible to extract all at once with a single Expression.
We must allow certain characters in the URL, see e.g. here.
That also includes the dash, colon, and semicolon.

So a possible Expression could be:

[face=courier](?:http://|https://|ftp://|www\.|mailto:)[!\*'\(\);:@&=\+\$,/\?#\[\]A-Za-z0-9\-_\.~%]+[/face]
(Replacement is just $0)

There will probably be a lot of URLs not working with this.

Just try it for yourself and check which URLs don't match.

If you allow the colon and semi-colon as separators, you must remove them from the expression,
but this will dismiss URL parts starting with these:

[face=courier](?:http://|https://|ftp://|www\.|mailto:)[!\*'\(\)@&=\+\$,/\?#\[\]A-Za-z0-9\-_\.~%]+[/face]

Also removing the dash is probably not a good idea, since many domains use them.
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Generally, isn't parsing HTML just (=definitely, provably) beyond what one can do with regular expressions?
I'm afraid that applies even if they're of the "extended" kind (ie more powerful than "regular" in the strict CS sense).

Nevertheless, a plugin coming near even only what's possible at all might still be very useful :)
Post Reply