[WDX] PCREsearch

milo1012 · Post by *milo1012 » 2013-09-25, 19:10 UTC

New Version 1.0!

- now up to 99 different expressions and wdx-fields usable
- new .ini-config file which now holds all expressions and the new options
- all expressions can now have custom names which are optionally returned as field names
- optional prefix (!) in field names for expression errors and missing expressions
- new memory/size options (limiting file size globally, for custom columns and set the memory for file splitting)
- option to load the first expression from "regex1.txt" instead from ini (for complicated expressions with line breaks)
- option to either match all or match never in case of a faulty expression
- a log file ("PCREsearch_error.log") with PCRE error messages is now issued in case of incorrect expressions (entries appended) (optional)
- added ContentStopGetValueW (working)
- added long filename support (path >= 259 characters)
- small bug fixes
- slight performance increase for ft_delayed

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2014-01-01, 21:18 UTC

New Version 1.1!

- new function: count strings/RegEx matches for each file + count only individual
- new function: file encoding check
- new field options (flags): use OEM(DOS) code page, disable Unicode properties
- new ini entries for the field type (function)
- new ini entries for field flags
- new option: AnalyzeBuffer
- the default search is now case-insensitive to match TC's behavior
- speedup file analyze and search in small files (using file analyze buffer)
- improved encoding detection - some binary files not detected as Unicode anymore, pure ASCII files treated as ANSI for avoiding potential PCRE crashes
- OnDemandLimit minimum now 1 MiB (for speedup on counting fields)
- MatchAllForErrors (.*) now only true for non-empty files - empty result strings don't count any more
- character properties can now work in ANSI/OEM files for values >128 (e.g. \w for accented characters like German umlauts etc.)
- INI can now be up to 1 MiB in size
- plug-in now also works on Win 98, ME and NT 4.0
- updated to PCRE 8.34
- small optimizations and fixes
- released source

Check the first post for the new file.

nsp · Post by *nsp » 2014-01-20, 00:29 UTC

plugin suggestion !
Could you return also matched expressions instead of yes or any count.
if i have an expression like ;
^First

.*)$
you should return lines starting by "First :" you can even add some option like
- first match
- all matches
- and also with replacement ex "found $1" should return "found "and "first group"....

This could be useful also for user column to extract info from file on the fly !

milo1012 · Post by *milo1012 » 2014-01-20, 01:24 UTC

nsp wrote:Could you return also matched expressions instead of yes or any count.

Yes, could be done, although you'd need enough space for a column in that case.
I'll see what I can do for the next version.
I just don't want to overdo the configuration file complexity, it's already quite high.

milo1012 · Post by *milo1012 » 2014-06-10, 02:28 UTC

New Version 1.2!

-new function: return the matched strings to Total Commander directly
-new function: return the first matched string only
-new function: assemble the returned string with a custom replacement scheme, which can reference subgroups and single bytes
-new ini entries for the replacement scheme
-fixed potential infinite loop when memory allocation failed for large files
-updated to PCRE 8.35
-small optimizations
-now using PCREsearch.Sample.ini for not overwriting the config by an update, so copying it to PCREsearch.ini before using the plug-in is recommended

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2014-08-19, 00:27 UTC

New Version 1.3!

- new major improvement: in-memory result caching:

greatly improves usability by storing already obtained fields in memory (sth. that not even TC does!)
stores file information to identify fields requested again: file attributes, last file write time, file size
old cached values displaced successively

- new ini entry for cache size: from 8k to 256k entries, or disable caching
- option for result caching: clear complete cache when user presses F2 or Ctrl+R (or corresponding menu command) to refresh any current view
- added option to allow binary zeros (two joined zero bytes) for detecting UTF-16 files
- slightly reduced I/O load by using the encoding check buffer in any case and not reading the file from beginning again (for files > check buffer size)
- \R now matches only CRLF combinations for ANSI/binary files, how it should be (prefix (*BSR_UNICODE) to enable Unicode newlines such as 0x85 again, like for Unicode files)
- fixed rare crash when attempting to analyse files with size between 2 and 4 bytes (ambiguous documentation in UTF-8 CPP)
- minor optimizations

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2015-01-16, 03:21 UTC

New Version 1.5!

- new function: return random strings by providing a regex, which is matched by creating random Unicode characters until the string's length is satisfied
-> is an easy way to define what characters a custom string should consist of
-> uses a WELL (Well equidistributed long-period linear) 512-bit random generator
-> uses a pre-filter to rule out certain non-printable Unicode characters and characters forbidden by Windows/File system
-> currently restricted to Unicode BMP (Basic Multilingual Plane)
-> might be useful for randomizing file names in MRT, or quick random filling fields from different plug-ins in TC (e.g. Comments)

- added possibility to allow empty (void) matches, by using a new field flag (might be useful for certain expressions, like counting lines, where empty lines must be allowed to create a match)

- new field flag for using only the encoding check buffer for search, and doing no further read operation on that file
-> might speed up certain use cases, like quickly reading the header/magic number at the beginning of files, where you don't want to read at least 5 MiB otherwise

- fixed inconsistency with result caching due to multiple TC threads (should now work flawless)
- fixed potential endless loops with certain expressions (all possible and valid expressions should now work flawless)
- updated to pcre 8.36
- and more...

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2015-02-04, 04:04 UTC

I made some experiments to implement a text/document filter in PCREsearch.
The main idea is to use converters that generate raw text, like the one used in the existing TextSearch plug-in.

Basically only xdoc2txt is still useful, because:

it was updated constantly during the last years
it supports Unicode
it writes to stdout
according to docs it now officially supports:
.sxw. sxc .sxi .sxd .odt .ods .odp .odg .docx / .docm .xlsx / .xlsm .pptx / .pptm .doc .xls .ppt .rtf .wri .pdf .mht .html
plus some other rare formats
it can now work with iFilter natively, also for non-built-in formats (e.g. the built-in XML and Office IFilters in Win7 work great)

Therefore GetTextIFilter becomes obsolete (I didn't like the .Net runtime dependency anyway),
and OdfToTxt is already replaced by TC's own search ability since 8.50.
If you know any other (free) converter tools that preferably write to stdout, tell me please!

Thus xdoc2txt is the workhorse for now and it shows to work well and quite quick.
It seems a bit faster than TextSearch, mainly due to using pipes and not temporary files.
It's probably not a full replacement for TextSearch, since it doesn't support full text search in TC,
(I won't implement this until TC gets an Unicode type for full text search),
but you can still do things that neither xPDFSearch or TextSearch can, like counting (individual) search results in files
and of course full PCRE RegEx search with custom output strings.

What I couldn't test yet:

different iFilters, especially Adobe's official pdf filter and some exotic file formats
verify that IFilters work for both x64 and x86 version
stability of xdoc2txt on slow machines and different OSes

Note: xdoc2txt unfortunately requires Visual C++ 2008 Redistributable,
but you can bypass it by patching the exe, the necessary DLLs and instructions are included in the package.
(I won't redistribute a hacked file)
Memory requirements can be quite high for these newer xdoc2txt versions,
for some large documents I'd recommend at least 512 MB.

Note: no updated readme yet, just look at the new ini file for how to configure the filter (xdoc2txt that is).
Just remember to add '16' to your regexXflags entries.
You probably should restrict e.g. search in TC to the file types you defined in the INI if you want to test the filter,
because otherwise all file types not defined to use the converter are searched with the standard (raw) method.

Download

Please give me some feedback in case you tested it.

milo1012 · Post by *milo1012 » 2015-04-02, 07:19 UTC

New Version 2.0!

- new major improvement: a plug-in config tool, where you can set all fields and options comfortably, which may remove the need to modify the ini file by yourself,
- new major improvement: document filter for expanding the otherwise raw file search to most office/text documents,
which should at least support: .pdf .sxw .sxc .sxi .sxd .odt .ods .odp .odg .docx/.docm .xlsx/.xlsm .pptx/.pptm .doc .xls .ppt .rtf .wri .mht .html
- new string replacement: return file offset for the current result
- new string replacement: return the line numbers the current results originate from
- new field type: return average string/result length
- field names can now be sorted alphabetically
- increased maximum field number to 999
- large speedup for random string generation
- many fixes and internal improvements...

Check the first post for the new file.

milo1012 · Post by *milo1012 » 2015-07-30, 03:27 UTC

New Version 2.1!

- new major feature: compare files in TC's 'Synchronize dirs' function, which works with different file encodings and can compare case sensitive or insensitive
- config tool update
- several fixes and improvements
- updated to pcre 8.37

Check the first post for the new file.

Skif_off · Post by *Skif_off » 2016-04-29, 00:55 UTC

2milo1012
I read description and readme, but sorry, I don't understand and I have a question: is it possible to add field for filenames manipulations? It can be useful for sorting, for example:
"^(\d+\s)(.*)" replace to "$2"
"123338 name1.ext" > > > "name1.ext"
"847585 name2.ext" > > > "name2.ext"

P.S. regexp_wdx x86 only.

milo1012 · Post by *milo1012 » 2016-04-29, 01:34 UTC

Skif_off wrote:is it possible to add field for filenames manipulations?.

Currently the plug-in is for file content only. But sure, I can add the possibility to search in the filename for the next version and mimic what regexp_wdx does.
Implementing it will probably be fairly easy, although it would make the plug-in configuration even more complicated (obviously).

BTW, I'm currently working on supporting Oracle OiT Content/Text Access for the plug-in (basically the same "engine" that is used for uLister, so you could share most of the DLL files), and enabling full wdx text search for it, since Christian confirmed Unicode wdx text search for TC 9.
That way we'd have some reliable, fast and stable text content access for all kinds of office/text formats (compared to using external filter programs, like xdoc2txt, which seems to crash for some files).
I'm not sure when I have the time to finish it, but in any case I can't test it until the first beta of TC 9 of course.

nsp · Post by *nsp » 2016-04-29, 08:33 UTC

Skif_off wrote:2milo1012
I read description and readme, but sorry, I don't understand and I have a question: is it possible to add field for filenames manipulations? It can be useful for sorting, for example:
"^(\d+\s)(.*)" replace to "$2"
"123338 name1.ext" > > > "name1.ext"
"847585 name2.ext" > > > "name2.ext"

P.S. regexp_wdx x86 only.

You can already with Multi Rename Tool to do such filename manipulation even if the syntax is a bit restrictive in comparison to PCRE.you have to tick RegEX with or

Code: Select all

find:^[0-9]+\s(.*)
replace:$1

Skif_off · Post by *Skif_off » 2016-04-29, 10:39 UTC

2milo1012
Thanks, I'll wait

About Oracle: Search Export product only? It's 80-90 mb libs for x86+x64

Is it hoped that the plugin compatibility with WinXP will remain and in new version(s)?

2nsp
Yes, but I don't want to rename, I want the column in custom columns view and sorting files by part of the filename or by the modified filename without renaming (see description and/or try regexp_wdx).

Horst.Epp · Post by *Horst.Epp » 2016-04-29, 11:26 UTC

Skif_off wrote:2milo1012
Thanks, I'll wait About Oracle: Search Export product only? It's 80-90 mb libs for x86+x64
Is it hoped that the plugin compatibility with WinXP will remain and in new version(s)?

What is the problem with such a size on todays machines ?
Also Win-XP compatibility should no longer be a major goal as it restricts development and delays the dead of this OS.

[WDX] PCREsearch

New Beta with text filter support