pdf content plugins unreliable

Discuss and announce Total Commander plugins, addons and other useful tools here, both their usage and their development.

Moderators: Hacker, petermad, Stefan2, white

Post Reply
baronPlug
Junior Member
Junior Member
Posts: 10
Joined: 2014-12-28, 13:10 UTC

pdf content plugins unreliable

Post by *baronPlug »

I was looking at the source code of xPDFSearch plugin to familiarize myself with advanced content extraction. I don't know where we are supposed to report bugs with plugins so I'm putting a note here

this xPDFSearch plugin first of all disregards completely the maxlen parameter used by ContentGetValue and uses its own MAX_PATH constraint. It works by accident really because TC always uses a 2K read buffer

moreover trying it on a PDF I see that it doesn't find all the words that are in the file. I have a small PDF to demonstrate the fault but I don't know where to upload it

then I tried the TextSearch plugin which also finds text in PDFs and then I discovered that it also fails to find words -- different words! So to find text in PDFs reliably one has to use both these plugins simultaneously

certainly not a very good picture for the plugins available. What's the most reliable plugin for text extraction?

PS. note I'm talking about plain english text search. More complex languages like greek are altogether out of the question as there is no unicode support (FT_FULLTEXTW)
User avatar
Lefteous
Power Member
Power Member
Posts: 9536
Joined: 2003-02-09, 01:18 UTC
Location: Germany
Contact:

Post by *Lefteous »

2baronPlug
I don't know where we are supposed to report bugs with plugins so I'm putting a note here
There is a documentation included which points to a certain thread in this forum:
http://www.ghisler.ch/board/viewtopic.php?t=7423
this xPDFSearch plugin first of all disregards completely the maxlen parameter used by ContentGetValue and uses its own MAX_PATH constraint.
Thanks for the hint. I will investigate it.
moreover trying it on a PDF I see that it doesn't find all the words that are in the file. I have a small PDF to demonstrate the fault but I don't know where to upload it
Just send me an email or upload it somewhere and link to it here.
baronPlug
Junior Member
Junior Member
Posts: 10
Joined: 2014-12-28, 13:10 UTC

Post by *baronPlug »

you can download it from here (1MB) http://www.4shared.com/archive/P25msUPvce/bugpdf.html

xpdf plugin cannot find the word "equilibree"
the textsearch plugin cannot find the word "congratulations"
if you open the PDF in a viewer you'll see both these words exist
User avatar
Dalai
Power Member
Power Member
Posts: 9960
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Post by *Dalai »

baronPlug wrote:you can download it from here (1MB) http://www.4shared.com/archive/P25msUPvce/bugpdf.html
You should upload to a hoster that allows anonymous download. I used BugMeNot to download the file.
xpdf plugin cannot find the word "equilibree"
That's because the word is not in there but "équilibrée" is, which xPDFsearch has no problem to find on my system. Did you select the field "Text"?
the textsearch plugin cannot find the word "congratulations"
I can confirm that.

Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
baronPlug
Junior Member
Junior Member
Posts: 10
Joined: 2014-12-28, 13:10 UTC

Post by *baronPlug »

ok so the "équilibrée" must be a problem with my code page. It cannot be found. Strange that the textsearch plugin can do it though - but only when entered without the accented é. It chooses to waste accents it seems

anyway all this is quite a headache. There are other file managers that work with desktop search (e.g. xplorer2) that are much simpler to use for search and they deliver the goods. We can only hope one day the Author will support unicode searches, amen :)
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

baronPlug wrote:the textsearch plugin cannot find the word "congratulations"
This is a known bug an you can fix this quite easy:
Textsearch uses external converter tools to do all it's work, which is quite a good thing IMO, since you can update these w/o a recompile of the main plugin file.
For PDF it's the xdoc2txt.exe tool, which does the necessary translation.
But it's quite buggy and an old version ships with textsearch, so you should update it:
http://ebstudio.info/home/xdoc2txt.html
use the newest MBCS 1.50 (don't use the Unicode version)
Textsearch now finds "congratulations" in your sample PDF.

BTW, you can view the raw output by using

Code: Select all

xdoc2txt.exe "yourfile.pdf" > output.txt
baronPlug wrote:ok so the "équilibrée" must be a problem with my code page
What code page does your system have? Probably not 1252, otherwise it would be indeed very strange, because it works fine for me too.


2Dalai
Dalai wrote:I used BugMeNot to download the file
Seriously? Bugmenot NEVER had any working logins when I looked for them in the past, and the same happens now: none of them works.
I had to use an alternative to finally get some free login.
I hate 4shared. People should really stop using that site, there are tons of alternative one click hosters w/o a login needed.
TC plugins: PCREsearch and RegXtract
User avatar
Dalai
Power Member
Power Member
Posts: 9960
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Post by *Dalai »

[OT]
milo1012 wrote:2Dalai
Dalai wrote:I used BugMeNot to download the file
Seriously?
Yup, might have been a coincidence, though.
Bugmenot NEVER had any working logins when I looked for them in the past, and the same happens now: none of them works.
The logins sometimes work. However, the ones to 4shared are quite troublesome. I found out that sometimes the login works and sometimes the same one doesn't. Probably they have some mechanisms to detect logins from different IPs in a small timespan und block them. Nevertheless, BugMeNot is my first choice to look for a login to a site that I most probably won't visit again in the future.
I hate 4shared. People should really stop using that site, there are tons of alternative one click hosters w/o a login needed.
Agreed.

[/OT]

Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Post Reply