pdf content plugins unreliable
Moderators: Hacker, petermad, Stefan2, white
pdf content plugins unreliable
I was looking at the source code of xPDFSearch plugin to familiarize myself with advanced content extraction. I don't know where we are supposed to report bugs with plugins so I'm putting a note here
this xPDFSearch plugin first of all disregards completely the maxlen parameter used by ContentGetValue and uses its own MAX_PATH constraint. It works by accident really because TC always uses a 2K read buffer
moreover trying it on a PDF I see that it doesn't find all the words that are in the file. I have a small PDF to demonstrate the fault but I don't know where to upload it
then I tried the TextSearch plugin which also finds text in PDFs and then I discovered that it also fails to find words -- different words! So to find text in PDFs reliably one has to use both these plugins simultaneously
certainly not a very good picture for the plugins available. What's the most reliable plugin for text extraction?
PS. note I'm talking about plain english text search. More complex languages like greek are altogether out of the question as there is no unicode support (FT_FULLTEXTW)
this xPDFSearch plugin first of all disregards completely the maxlen parameter used by ContentGetValue and uses its own MAX_PATH constraint. It works by accident really because TC always uses a 2K read buffer
moreover trying it on a PDF I see that it doesn't find all the words that are in the file. I have a small PDF to demonstrate the fault but I don't know where to upload it
then I tried the TextSearch plugin which also finds text in PDFs and then I discovered that it also fails to find words -- different words! So to find text in PDFs reliably one has to use both these plugins simultaneously
certainly not a very good picture for the plugins available. What's the most reliable plugin for text extraction?
PS. note I'm talking about plain english text search. More complex languages like greek are altogether out of the question as there is no unicode support (FT_FULLTEXTW)
2baronPlug
http://www.ghisler.ch/board/viewtopic.php?t=7423
There is a documentation included which points to a certain thread in this forum:I don't know where we are supposed to report bugs with plugins so I'm putting a note here
http://www.ghisler.ch/board/viewtopic.php?t=7423
Thanks for the hint. I will investigate it.this xPDFSearch plugin first of all disregards completely the maxlen parameter used by ContentGetValue and uses its own MAX_PATH constraint.
Just send me an email or upload it somewhere and link to it here.moreover trying it on a PDF I see that it doesn't find all the words that are in the file. I have a small PDF to demonstrate the fault but I don't know where to upload it
you can download it from here (1MB) http://www.4shared.com/archive/P25msUPvce/bugpdf.html
xpdf plugin cannot find the word "equilibree"
the textsearch plugin cannot find the word "congratulations"
if you open the PDF in a viewer you'll see both these words exist
xpdf plugin cannot find the word "equilibree"
the textsearch plugin cannot find the word "congratulations"
if you open the PDF in a viewer you'll see both these words exist
You should upload to a hoster that allows anonymous download. I used BugMeNot to download the file.baronPlug wrote:you can download it from here (1MB) http://www.4shared.com/archive/P25msUPvce/bugpdf.html
That's because the word is not in there but "équilibrée" is, which xPDFsearch has no problem to find on my system. Did you select the field "Text"?xpdf plugin cannot find the word "equilibree"
I can confirm that.the textsearch plugin cannot find the word "congratulations"
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
ok so the "équilibrée" must be a problem with my code page. It cannot be found. Strange that the textsearch plugin can do it though - but only when entered without the accented é. It chooses to waste accents it seems
anyway all this is quite a headache. There are other file managers that work with desktop search (e.g. xplorer2) that are much simpler to use for search and they deliver the goods. We can only hope one day the Author will support unicode searches, amen
anyway all this is quite a headache. There are other file managers that work with desktop search (e.g. xplorer2) that are much simpler to use for search and they deliver the goods. We can only hope one day the Author will support unicode searches, amen

This is a known bug an you can fix this quite easy:baronPlug wrote:the textsearch plugin cannot find the word "congratulations"
Textsearch uses external converter tools to do all it's work, which is quite a good thing IMO, since you can update these w/o a recompile of the main plugin file.
For PDF it's the xdoc2txt.exe tool, which does the necessary translation.
But it's quite buggy and an old version ships with textsearch, so you should update it:
http://ebstudio.info/home/xdoc2txt.html
use the newest MBCS 1.50 (don't use the Unicode version)
Textsearch now finds "congratulations" in your sample PDF.
BTW, you can view the raw output by using
Code: Select all
xdoc2txt.exe "yourfile.pdf" > output.txt
What code page does your system have? Probably not 1252, otherwise it would be indeed very strange, because it works fine for me too.baronPlug wrote:ok so the "équilibrée" must be a problem with my code page
2Dalai
Seriously? Bugmenot NEVER had any working logins when I looked for them in the past, and the same happens now: none of them works.Dalai wrote:I used BugMeNot to download the file
I had to use an alternative to finally get some free login.
I hate 4shared. People should really stop using that site, there are tons of alternative one click hosters w/o a login needed.
TC plugins: PCREsearch and RegXtract
[OT]
[/OT]
Regards
Dalai
Yup, might have been a coincidence, though.milo1012 wrote:2DalaiSeriously?Dalai wrote:I used BugMeNot to download the file
The logins sometimes work. However, the ones to 4shared are quite troublesome. I found out that sometimes the login works and sometimes the same one doesn't. Probably they have some mechanisms to detect logins from different IPs in a small timespan und block them. Nevertheless, BugMeNot is my first choice to look for a login to a site that I most probably won't visit again in the future.Bugmenot NEVER had any working logins when I looked for them in the past, and the same happens now: none of them works.
Agreed.I hate 4shared. People should really stop using that site, there are tons of alternative one click hosters w/o a login needed.
[/OT]
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror