Search word, but "within x words of" range only?

English support forum

Moderators: Hacker, petermad, Stefan2, white

Post Reply
johnstonf
Junior Member
Junior Member
Posts: 76
Joined: 2004-04-10, 20:30 UTC

Search word, but "within x words of" range only?

Post by *johnstonf »

Searching "within x words of"

Is there a way to specify in a search, to say "I'm looking for the word ORANGE to be within 100 words of APPLE" (Just an example), in a text file.

I remember with Novell, there was such an option when searching, which was GREAT!.

If not, do you know of any products that WOULD allow this?

(I need to search our huge mdaemon logs for a specific email address and specific subject, but there are thousands and thousands to search through, driving me nuts).
User avatar
ZoSTeR
Power Member
Power Member
Posts: 1049
Joined: 2004-07-29, 11:00 UTC

Post by *ZoSTeR »

You could use TextCrawler with this regular expression:

Code: Select all

\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b
It searches for "word1" near "word2" within a range of 0 to 5 words.
The regex library in TC is rather limited for cases like this. It has no non-capturing groups and a scope of only one line.
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

ZoSTeR wrote:The regex library in TC is rather limited for cases like this. It has no non-capturing groups and a scope of only one line.
Thats the main reason why I wrote PCREsearch.
That Expression also works here, no need to use things like TextCrawler.

Just modify the INI file, e.g. :

Code: Select all

regex1=\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b
It might be a bit dull to modify the INI every time, but for frequently (re-)used expressions it's fine, and you can have multiple fields.
TC plugins: PCREsearch and RegXtract
User avatar
ZoSTeR
Power Member
Power Member
Posts: 1049
Joined: 2004-07-29, 11:00 UTC

Post by *ZoSTeR »

Yes I nearly forgot about PCREsearch and it's great if you want to find and handle the files that contain a specific pattern.

TextCrawler has the advantage of displaying all the matching text plus its context on the fly. Since the OP has to look at log files it's my guess that this important.

Dunno if there's any way to combine "feed to listbox" with quick-view/lister to display the matching text. I guess one could build a nice summary with RegXtract, it depends on what the final result or workflow is supposed to be.
johnstonf
Junior Member
Junior Member
Posts: 76
Joined: 2004-04-10, 20:30 UTC

Post by *johnstonf »

I'm not great with RegEx... will this find even if they are on separate lines (eg... within 20 words, but that may be 3 lines down)... THANKS

PS what's a good way to get up to speed on RegEx?

milo1012 wrote:
ZoSTeR wrote:The regex library in TC is rather limited for cases like this. It has no non-capturing groups and a scope of only one line.
Thats the main reason why I wrote PCREsearch.
That Expression also works here, no need to use things like TextCrawler.

Just modify the INI file, e.g. :

Code: Select all

regex1=\b(?:word1\W+(?:\w+\W+){0,5}?word2|word2\W+(?:\w+\W+){0,5}?word1)\b
It might be a bit dull to modify the INI every time, but for frequently (re-)used expressions it's fine, and you can have multiple fields.
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

johnstonf wrote:will this find even if they are on separate lines (eg... within 20 words, but that may be 3 lines down).
Yes, and to honor your example:

Code: Select all

\b(?:APPLE\W+(?:\w+\W+){0,5}?ORANGE|ORANGE\W+(?:\w+\W+){0,5}?APPLE)\b
will find

Code: Select all

BANANA BANANA BANANA BANANA ORANGE BANANA

BANANA
BANANA BANANA

BANANA APPLE BANANA BANANA
because there are only five "words" between them (which is allowed), but not this:

Code: Select all

BANANA BANANA BANANA BANANA ORANGE BANANA

BANANA
BANANA BANANA
BANANA
BANANA APPLE BANANA BANANA
(six is too much - won't match)

So all you need to do: replace the quantifier in the curly brackets (both) with the distance you want,
and ORANGE/APPLE with the actual words you're looking for.
Take care if these words/strings contain some RegEx syntax characters, you'd need to escape them if they do.

Now in the PCREsearch.Sample.ini (or create a new PCREsearch.ini file) use for example these entries:

Code: Select all

[PCREsearch]
regex1=\b(?:APPLE\W+(?:\w+\W+){0,5}?ORANGE|ORANGE\W+(?:\w+\W+){0,5}?APPLE)\b
regex1name=ORANGE and APPLE near each other (5)
regex1type=0
This will only work for files containing pure text, you can't search in office files, PDFs and similar (yet).

To output the resulting string (for custom columns) use this

Code: Select all

regex1type=3
(limited to 1022 characters)

johnstonf wrote:PS what's a good way to get up to speed on RegEx?
The TC help has a section for RegEx, which describe the basics quite good, including how to escape characters,
but for advanced expressions (like the one above) you probably want to take your time and read some literature
(ZoSTeR 2nd link directs to such a book)
or try some sites or programs, like regular-expressions.info or regexbuddy (but I wouldn't advocate it for beginners).
I could also recommend my RegXtract plugin to test the expression, it also has a syntax summary.
TC plugins: PCREsearch and RegXtract
johnstonf
Junior Member
Junior Member
Posts: 76
Joined: 2004-04-10, 20:30 UTC

Post by *johnstonf »

Thanks so much...
milo1012 wrote:
johnstonf wrote:will this find even if they are on separate lines (eg... within 20 words, but that may be 3 lines down).
Yes, and to honor your example:

Code: Select all

\b(?:APPLE\W+(?:\w+\W+){0,5}?ORANGE|ORANGE\W+(?:\w+\W+){0,5}?APPLE)\b
will find

Code: Select all

BANANA BANANA BANANA BANANA ORANGE BANANA

BANANA
BANANA BANANA

BANANA APPLE BANANA BANANA
because there are only five "words" between them (which is allowed), but not this:

Code: Select all

BANANA BANANA BANANA BANANA ORANGE BANANA

BANANA
BANANA BANANA
BANANA
BANANA APPLE BANANA BANANA
(six is too much - won't match)

So all you need to do: replace the quantifier in the curly brackets (both) with the distance you want,
and ORANGE/APPLE with the actual words you're looking for.
Take care if these words/strings contain some RegEx syntax characters, you'd need to escape them if they do.

Now in the PCREsearch.Sample.ini (or create a new PCREsearch.ini file) use for example these entries:

Code: Select all

[PCREsearch]
regex1=\b(?:APPLE\W+(?:\w+\W+){0,5}?ORANGE|ORANGE\W+(?:\w+\W+){0,5}?APPLE)\b
regex1name=ORANGE and APPLE near each other (5)
regex1type=0
This will only work for files containing pure text, you can't search in office files, PDFs and similar (yet).

To output the resulting string (for custom columns) use this

Code: Select all

regex1type=3
(limited to 1022 characters)

johnstonf wrote:PS what's a good way to get up to speed on RegEx?
The TC help has a section for RegEx, which describe the basics quite good, including how to escape characters,
but for advanced expressions (like the one above) you probably want to take your time and read some literature
(ZoSTeR 2nd link directs to such a book)
or try some sites or programs, like regular-expressions.info or regexbuddy (but I wouldn't advocate it for beginners).
I could also recommend my RegXtract plugin to test the expression, it also has a syntax summary.
johnstonf
Junior Member
Junior Member
Posts: 76
Joined: 2004-04-10, 20:30 UTC

Post by *johnstonf »

I made a YouTube video showing others how to quickly get this installed into TC. See it at http://youtu.be/ohfcQAOy3ZU and hope it helps others to get this nice plugin into their lives quickly and easily.

http://youtu.be/ohfcQAOy3ZU
Post Reply