[WCX] RegXtract - String Extractor with RegEx

Discuss and announce Total Commander plugins, addons and other useful tools here, both their usage and their development.

Moderators: white, Hacker, petermad, Stefan2

User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

meisl wrote:Generally, isn't parsing HTML just (=definitely, provably) beyond what one can do with regular expressions?
Sure, but we're not parsing here,
but instead just "describing" a grammar of our strings, which is sufficient for most purposes.
Of course, parsing in the literal sense is != Regular Expressions.
RegEx is just one lexing step there, whose output is used afterwards.

The idea is to just use the RegEx results and save them all at once.
If you have a decent distinction between the different strings, it works.
If not (like in the example from trevor12), you're limited or have to work around it.

Anyway, I don't think the idea as such is new.
You're probably able to do the same with the Unix shell utilities
grep, AWK and/or sed, but probably not very comfortable (scripting).
adoeller wrote:the stats were written when adding a new setting. i have to enable and disable the button, then it works as it should
Well, I'm trying to reproduce it, but it works like expected for me all the time.
Can you describe the steps you did until this happens?
meisl
Member
Member
Posts: 171
Joined: 2013-12-17, 15:30 UTC

Post by *meisl »

Yep & Thanks milo1012, just wanted to make sure I'm getting it alright.
You're probably able to do the same with the Unix shell utilities
grep, AWK and/or sed, [...]
- and you can do a lot with these, so still very cool, as said.

Just to mention: I think AWK and sed are both Turing-complete, so in theory you could - besides anything else - well do HTML parsing with these.
Anyways, plz don't bother answering this :)
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

New Version 1.5!

Added: Two "special" replacement possibilities, which can be referenced anywhere in the replacement string
- The (increasing) number of the current output result, with optionally definable leading zeros
- The filename the results originate from, either with full path, or just the pure filename
Added: Search and replace mode can now work for all file types, binary mode not forced anymore
Added: Character properties can now work in ANSI/OEM files for values >128
Added: Memory buffer for encoding check can now be defined for the session
Added: Dialog: Expression test can now switch and display subgroups besides group 0
Added: Translation support - added "regxtract.lng" file, contains only German for now
...
Improved encoding detection
Improved hash function
...
and more...

See the readme file for more details.

Check the first post for the new file.
User avatar
makinero
Senior Member
Senior Member
Posts: 268
Joined: 2013-10-26, 10:05 UTC

Post by *makinero »

1. Where to unpack the files that are in the zip archive?
2. How to set this up?
User avatar
Horst.Epp
Power Member
Power Member
Posts: 6449
Joined: 2003-02-06, 17:36 UTC
Location: Germany

Post by *Horst.Epp »

makinero wrote:1. Where to unpack the files that are in the zip archive?
2. How to set this up?
If you open this archive in TC it automatically installs it as a packer plugin !
Please try such simple step and view the readme which comes with it.
Then you may ask for help if you still don't understand.
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

how to clean (remove) *.txt file from non alphanumeric signs

Post by *trevor12 »

hello, is it possible use your plugin (by regex or search->replace) to clean one my 0,5 mb *.txt file (plain text, windows-1250 codepage) from non alphanumeric signs ? For my need (and my Czech language) "alphanumeric" are only a-z, A-Z, č, ď, é, ě, ň, ř, š, ť, ú, ů, ž, Č, Ď, É, Ě, Ň, Ř, Š, Ť, Ú, Ž, space, dot, !, ?, 0-9.

Everything else I want "be replaced for space (or nothing)"
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: how to clean (remove) *.txt file from non alphanumeric s

Post by *milo1012 »

trevor12 wrote:hello, is it possible use your plugin (by regex or search->replace) to clean one my 0,5 mb *.txt file (plain text, windows-1250 codepage) from non alphanumeric signs ?
...
Sure, there are several possibilities.
The easiest is probably:

Regular Expression:

Code: Select all

[^a-zA-ZčďéěňřšťúůžČĎÉĚŇŘŠŤÚŽ \.!\?0-9]
(take care when copying the above string, some browsers seem to insert a space character at the end, so remove that if you find it)

Replace String:

Code: Select all

(empty the box) or type a space
check
[x] Search and Replace
and maybe set Outfile Extension to "First File".

Now, if your system is set to Czech and Codepage 1250, you can leave the Sys ANSI page (1250) setting.
Otherwise, select the 1250 page manually.

Just remember that the plugin doesn't work "in-place", this means you'll get a new file with your undesired characters removed.
So you should place the output file in a new dir/folder, for not overwriting the input file.
TC plugins: PCREsearch and RegXtract
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

re

Post by *trevor12 »

thank you I will try it tomorrow
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: how to clean (remove) *.txt file from non alphanumeric s

Post by *milo1012 »

Sorry, I forgot that the original Expression would also remove all line endings,
which is probably not what you intended, because it would generate a huge single line.

Use this Expression instead:

Code: Select all

[^a-zA-ZčďéěňřšťúůžČĎÉĚŇŘŠŤÚŽ \.!\?0-9\r\n]
TC plugins: PCREsearch and RegXtract
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

re

Post by *trevor12 »

thank you working perfectly ..
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

New Version 1.6!

Added: Search and replace for UTF-16 BE files doesn't convert to UTF-16 LE any more
- now all 4 file types keep their original detected encoding for s&r
Added: Dialog: "Group button" for quickly assigning selected text to a group
Added: Dialog: "Surround button" for surrounding selected text with "placeholders" (.*)
Added: it is now possible to keep the RegXtract.ini file in the same dir as TC's ini file
(switched by setting in pkplugin.ini)
Added: custom font for text boxes is now saved to ini on exit and therefore remembered between TC instances
Improved encoding detection for UTF-16
Changed: code pages for ANSI<->Unicode recoding are now checked for system availability every time the plug-in is loaded
Several internal improvements for a small general speedup on result processing
Updated to PCRE 8.36
...
and more...

See the readme file for more details.

Check the first post for the new file.
TC plugins: PCREsearch and RegXtract
trevor12
Junior Member
Junior Member
Posts: 65
Joined: 2012-12-06, 15:16 UTC
Location: Czech republic

clickable urls-linkify

Post by *trevor12 »

i have file txt that contains many urls as plaintext. is it possible use your plugin to change this urls to clickabl links that is linkify it ?
User avatar
Dalai
Power Member
Power Member
Posts: 9364
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Re: clickable urls-linkify

Post by *Dalai »

trevor12 wrote:i have file txt that contains many urls as plaintext. is it possible use your plugin to change this urls to clickabl links that is linkify it ?
Why bother with RegEx? Use a viewer or editor that shows them clickable. SynWrite can do this, although it's kind of slow when there're many URLs in a file. Even if you don't want to use such a tool, it's possible to do it without RegEx since you only have to add

Code: Select all

<a href="
before the URL and

Code: Select all

">linktext</a>
after it. So a simple loop in you favorite scripting/programming language would do the trick.

Regards
Dalai

PS: RegEx is inappropriate to parse HTML anyway. Yes, I know, you have the URLs in plaintext.
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Re: clickable urls-linkify

Post by *milo1012 »

Well, it would be possible to create the basic structure, but you'd need to append the HTML header
and footer when you have an output file.
It depends on how your URLs are separated and on how they begin.
Assuming all begin with http:// and are separated by space or new lines it is quite easy:
RegEx

Code: Select all

(https?://.*?)[\s\r\n]
replace:

Code: Select all

<a href="$1">$1</a>
(default options).

Just append <html> to the file's beginning
and </html> to the file's end.

If you tell me how your files are structured (how do the URLs look), I could modify the expression, or you try it yourself.


@Dalai: thanks a lot for answering the question for me.
I appreciate your quick answering in the forum, but maybe sometimes waiting can be appropriate, right?
Yes it is possible without RegEx, we all know different tools, but not everybody wants to use them immediately.
Dalai wrote:PS: RegEx is inappropriate to parse HTML anyway. Yes, I know, you have the URLs in plaintext.
What has parsing to do with it? We would do the opposite: assemble a file.
What's wrong with using a string collector with it? We don't need to care about any syntax here.
TC plugins: PCREsearch and RegXtract
User avatar
Dalai
Power Member
Power Member
Posts: 9364
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Re: clickable urls-linkify

Post by *Dalai »

milo1012 wrote:What has parsing to do with it? We would do the opposite: assemble a file.
What's wrong with using a string collector with it? We don't need to care about any syntax here.
I just wanted to prevent someone from using an inappropriate tool for a task. Yes, in this special case (put a file together) it's possible to use RegEx, but in the past couple of months there were quite a number of people who tried to use RegEx to parse HTML files (in other forums). And maybe, trevor12 is about to do something like this, although he/she didn't say it. So, it was just a hint in that direction (thus I wrote it in post scriptum).

BTT: Just a thought: The Search&Replace feature of any editor, e.g. Notepad++, can also be used for this task, with RegEx or without. So, no need for scripting/programming.

Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Post Reply