Text search with RegEx: Diff. results in TC and in plugins

Bug reports will be moved here when the described bug has been fixed

Moderators: white, Hacker, petermad, Stefan2

User avatar
Peter
Power Member
Power Member
Posts: 2064
Joined: 2003-11-13, 13:40 UTC
Location: Schweiz

Text search with RegEx: Diff. results in TC and in plugins

Post by *Peter »

In the (german) thread "Datenstruktur unter einer bestimmten Ebene löschen" (delete data structure beneath a defined level")
http://ghisler.ch/board/viewtopic.php?t=21273
I detected that searching for text with regex brings different results in TC and in content plugins:
http://ghisler.ch/board/viewtopic.php?t=21273&start=16

Code: Select all

I'm searching for textfiles which contains a line with the string [b]".....started (OR: gestarted) ....2004..." [/b](OR 2005 OR 2006 OR 2007)
I use the code

Code: Select all

.*g*?e*?started.*\.200[4567].*$
which works fine in "TC original search".

But using this regex-code in plugins like "filecontent" oder "filedescription" brings wrong results - it seem that the sign" $" (for CR/LF; end of line) is ignored, because it founds also textblocks with "started" in one line in "200*" in one of the following lines ...

Peter
TC 10.xx / #266191
Win 10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Can someone test this with TC 7.5 beta, please?
Author of Total Commander
https://www.ghisler.com
User avatar
Peter
Power Member
Power Member
Posts: 2064
Joined: 2003-11-13, 13:40 UTC
Location: Schweiz

Post by *Peter »

I am sorry, but I still use 7.04 in the company and therefore I can not test it with beta.

Peter
TC 10.xx / #266191
Win 10 x64
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

I do not completely follow the description of the bug in the first post. The "$" sign is located at the very end of the regular expression, so how could it affect TC behaviour in case when "started" and "2004" are on different lines? The "$" sign is not between them!

Anyway, I performed some test and found that there is indeed a bug, though a bit different from what is described. When TC searches for file contents by its own, it does not allow the "." meta-character to match newlines, but when a plugin is used, newlines are matched.

So, here is the complete description of how to reproduce.
1. Create two files with the following contents:
1.txt wrote:.....started in
....2004...
2.txt wrote:.....started in ....2004...
2. Open the TC search dialog, check Find text, check RegEx (2) enter the following regexp:

Code: Select all

started.*\.200[4567]
and start the search -> only the file 2.txt is found, i.e. the "." did not match newlines.
3. uncheck Find text, go to the Plugins tab, specify

Code: Select all

filedesc - Description - regex - started.*\.200[4567]
and start the search -> both files 1.txt and 2.txt are found, i.e. "." matches newlines.

This happens on both 7.04a and 7.50pb2. Sorry, if I misunderstood the topic-starter's bug and checked something different — in this case, please, separate this post into a new thread as a new bug-report.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

2Flint
The $ means end of LINE, not end of text. TC should find the end of a line in a text file. In TC 7, the plugins search didn't break the returned data into lines, TC 7.5 should do that now.

Try e.g. the filecontent plugin.
Author of Total Commander
https://www.ghisler.com
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

ghisler(Author) wrote:The $ means end of LINE, not end of text.
Yes, I know that. But I still don't understand how to test what the topic-starter reported. Here is his regular expression:

Code: Select all

.*g*?e*?started.*\.200[4567].*$
Note that there is only one "$" sign and it is located at the very end of the expression. Then, he wrote:
Peter wrote:it seem that the sign" $" <…> founds also textblocks with "started" in one line in <should be "and"?> "200*" in one of the following lines ...
But the "started" and "200*" are not separated in the quoted rexexp with the "$", so what has the "$" to do with the fact that "started" in is one line and "200*" in the next line?
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
Peter
Power Member
Power Member
Posts: 2064
Joined: 2003-11-13, 13:40 UTC
Location: Schweiz

Post by *Peter »

Hello Flint

thanks for testing this issue. If you know the german laanguage you can read the original thread here:
http://ghisler.ch/board/viewtopic.php?t=21273

There I wrote:
Ich habe Textdateien, in denen steht
Gestarted: bla bla 23.3.2004 bla bla
oder
Gestarted: 15.12.2006
oder
Started: bla bla 01.01.2007

Aufgabe:
Suche alle Textdateien, die einen Satz enthalten, der zuerst "started" und dann 2004 oder 2005 oder 2006 oder 2007 enthält.
Translated:
I have textfiles with content

Gestarted: bla bla 23.3.2004 bla bla

or

Gestarted: 15.12.2006

or

Started: bla bla 01.01.2007

Task:
Search all files with a line which contains the string "started" and then the string "2004" or "2005" ...
The problem I had was that sometimes it found strings in more than a single line like
started 2003
erased 2005

Started
Files: 2006


And that is the reason why I want to search only in "one line".

HTH :?:

Peter
TC 10.xx / #266191
Win 10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

So can anyone test this with TC 7.5 beta, please?
Author of Total Commander
https://www.ghisler.com
Postkutscher
Power Member
Power Member
Posts: 556
Joined: 2006-04-01, 00:11 UTC

Post by *Postkutscher »

Flint wrote:1. Create two files with the following contents:
1.txt wrote:.....started in
....2004...
2.txt wrote:.....started in ....2004...
2. Open the TC search dialog, check Find text, check RegEx (2) enter the following regexp:

Code: Select all

started.*\.200[4567]
and start the search -> only the file 2.txt is found, i.e. the "." did not match newlines.
3. uncheck Find text, go to the Plugins tab, specify

Code: Select all

filedesc - Description - regex - started.*\.200[4567]
and start the search -> both files 1.txt and 2.txt are found.
I have the same result with regexp started.*\.200[4567].*$ and same testdata as Flint`s on TC7.5PB2 .
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

Peter wrote:The problem I had was that sometimes it found strings in more than a single line like
started 2003
erased 2005

Started
Files: 2006


And that is the reason why I want to search only in "one line".
So, do you agree that your problem does not have anything to do with the "$" sign? In your regexp it is located at the very end and cannot affect parsing of newlines between "started" and "200x" substrings.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
Peter
Power Member
Power Member
Posts: 2064
Joined: 2003-11-13, 13:40 UTC
Location: Schweiz

Post by *Peter »

Flint wrote:So, do you agree that your problem does not have anything to do with the "$" sign? In your regexp it is located at the very end and cannot affect parsing of newlines between "started" and "200x" substrings.
I don't know - I'm not a regexp-specialist. In my opinion (better: the opinion of the user who made the expression and maybe the opinion of Christian Ghisler) the "$"should define the "end-of-line".

If it is not the "end-of-line" - was else is it?

Peter
TC 10.xx / #266191
Win 10 x64
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

Peter wrote:If it is not the "end-of-line" - was else is it?
It is, but what would it make the only (or the first) end of line, as you seemed to expect? The end of line marker tells that the end of line should be located where the "$" sign is, but it does not prohibit end of lines in other places.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
Peter
Power Member
Power Member
Posts: 2064
Joined: 2003-11-13, 13:40 UTC
Location: Schweiz

Post by *Peter »

Flint wrote:...It is, but what would it make the only (or the first) end of line, as you seemed to expect? ...
Maybe Christian Ghisler can explain if it is the first end-of-line after the founded text - or some end-of-line somewhere?

Peter
TC 10.xx / #266191
Win 10 x64
User avatar
Flint
Power Member
Power Member
Posts: 3487
Joined: 2003-10-27, 09:25 UTC
Location: Antalya, Turkey
Contact:

Post by *Flint »

Peter wrote:Maybe Christian Ghisler can explain if it is the first end-of-line after the founded text - or some end-of-line somewhere?
Not somewhere! Exactly at the place the "$" is located, but not only at this place. Actually, it depends on how the program treats the "." meta-character. For example, EmEditor has even a separate option if the dot is allowed to match newlines or not.

Let's suppose it is. Then the expression

Code: Select all

started.*2005.*$
would match each of the following text samples:
started in 2005
started
in 2005
started in 2005 and
finished elsewhen
started in
2005
and
finished
elsewhen
because the sub-expression ".*" would eat as much as possible, including newlines, if necessary, and the "$" sign would only match the very end of the text, no matter how many lines it took before the end was reached: all these lines have already been matched (eaten) by the ".*" sub-expression.

A completely different case is when dot is not allowed to match newlines. In this case your original expression (and my example above, as well), indeed, would mean that the "$" marks the position of the first newline. BUT! It's not because it is the nature of the "$" meta-character, but simply because you did not include any newline in the previous sub-expression.

So, your phrase from the very first post
it seem that the sign" $" (for CR/LF; end of line) is ignored
is not correct: the "$" sign is not ignored, it is matched and matched absolutely correctly, at the end of line! What was wrong was dot matching newlines in one case and not matching in another. That was what I found in my experiment, and that, I'm sure, should be the correct description of the bug: not the "$" sign but the "." sign.
Flint's Homepage: Full TC Russification Package, VirtualDisk, NTFS Links, NoClose Replacer, and other stuff!
 
Using TC 10.52 / Win10 x64
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Indeed the $ sign should match the end of line, but the end of the file is also considered an end of line.

In TC 7.04a, each data block delivered by plugins was considered as one "line", which was of course wrong - so a word which happened to be at the end of the data block was considered as being at the end of the line, although it wasn't. Also line breaks within the file would not be found.

An example:
1. Select the file size!.txt
2. Open the search dialog, plugins page,
3. Enter this search option:
Plugin "FileContent" -> "Full Text" -> "regex" -> "they$"

This should find the string "they" at the end of line 6 in SIZE!.TXT. TC 7.04a will not find it, but TC 7.50 will.
Author of Total Commander
https://www.ghisler.com
Post Reply