WDX plugin pdfOCR - Show details of PDF files

slavne · Post by *slavne » 2014-12-10, 20:38 UTC

WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.

Next you have the Password column as well that will present "YES" if some of your pdf files are protected with password. Also your pdf can have some of rights restricted. In both cases the column Password will state "yes". That is good for people to know if some pdf files needs to be relieved from password before put to normal use. Also it is good to know if file is protected before you try to open pdf for OCR processing.

Finally the column Pages shows the total number of pages so you can compare the "needOCR" pages with total number and decide if it is worth of OCR processing.
Download: http://www.totalcmd.net/plugring/pdfOCR.html

meepzorp · Post by *meepzorp » 2014-12-20, 04:13 UTC

love it. Thanks!

slavne · Post by *slavne » 2014-12-20, 10:21 UTC

meepzorp wrote:love it. Thanks!

You are welcome especially as the first user to say thanks

milo1012 · Post by *milo1012 » 2014-12-20, 15:27 UTC

Well, also thx from me!

But to get things clear:
It just counts pages that don't have text in them?
So even shortest snippets, like one letter/word in a page, makes them count as non-OCR?

I just think it would have been better to call such fields sth. like "non-Picture" or "text-contained pages",
because not every page that doesn't contain text is supposed to need OCR (This page [is] intentionally left blank ...)

2nd question: what pdf engine is used? Seems like CPDF Command Line Tools to me (krk.exe), am I right?
Any chance to link it statically (cause the source is available)?

I also suggest to call your analyzing procedure in background (return ft_delayed), because it gets really slow when I use custom columns.

slavne · Post by *slavne » 2014-12-20, 19:02 UTC

milo1012 wrote:Well, also thx from me! But to get things clear:
It just counts pages that don't have text in them?
So even shortest snippets, like one letter/word in a page, makes them count as non-OCR?

Thank you, thank you, thank you!

The program counts the number of pages with no font detected. So it would be maybe better called "no-font pages". I chose the name of columns towards users who usually don't care much what's behind the scene but more what functionality they get - that is to say if one needs to do OCR processing. Anyway, user can easily name any column as he prefers.

I have no secrets in front of you: yea I use cpdf temporarily for this beta version to fulfill my need of preparation of a large pdf library that I collected for years (more decades...) and I was a bit surprised how it was put to good use in my case. It is done now and I am glad.

milo1012 wrote:Any chance to link it statically (cause the source is available)?

Good idea.

milo1012 wrote:I also suggest to call your analyzing procedure in background (return ft_delayed), because it gets really slow when I use custom columns.

To be honest this is my very first plugin ever, and first C++ program after maybe 15 years; I still have to figure out not only the C++ but more how to work with that TC plugin thing . By the way the plugin can be accelerated considerably even without any of previous, but I have to find some time to make the next version. People have been downloading this plugin about 30 per day and yet previous reactions are the first to come. We shall see how many will be in the future.

Thanks for the suggestions, they are awesome!

xkxtnt · Post by *xkxtnt » 2017-06-05, 19:31 UTC

Unfortunately it is extremely slow, and caused TC to be unresponsive.

slavne · Post by *slavne » 2017-06-26, 12:47 UTC

xkxtnt wrote:Unfortunately it is extremely slow, and caused TC to be unresponsive.

Really sorry for that, I know the problem. The plugin needs to be improved but I have no time for that. I recommend somebody does the effort of making the good improved plugin for similar purpose.

mgroen · Post by *mgroen » 2021-03-13, 16:49 UTC

slavne wrote: ↑2014-12-10, 20:38 UTC WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.

Next you have the Password column as well that will present "YES" if some of your pdf files are protected with password. Also your pdf can have some of rights restricted. In both cases the column Password will state "yes". That is good for people to know if some pdf files needs to be relieved from password before put to normal use. Also it is good to know if file is protected before you try to open pdf for OCR processing.

Finally the column Pages shows the total number of pages so you can compare the "needOCR" pages with total number and decide if it is worth of OCR processing.
Download: http://www.totalcmd.net/plugring/pdfOCR.html

download link is dead. Any alternative download link?

Usher · Post by *Usher » 2021-03-13, 20:54 UTC

2mgroen
Try again with HTTPS link: https://www.totalcmd.net/plugring/pdfOCR.html
If you still have problems, wait a day, restart your system to refresh DNS cache and try once again.

Dalai · Post by *Dalai » 2021-03-13, 21:49 UTC

2Usher
totalcmd.net currently points to the wrong IP addresses, on several important (if not all) DNS servers, Quad9, 1.1.1.1 and Google among them. No system reboot, access via HTTPS or DNS cache flush is going to help with this. The only options are to wait and/or to add the correct IP address to the hosts file as pointed out by Flint.

Regards
Dalai

DrShark · Post by *DrShark » 2021-03-14, 18:01 UTC

mgroen wrote: ↑2021-03-13, 16:49 UTCdownload link is dead. Any alternative download link?

If you can edit the hosts file, you can get back the access to totalcmd.net and wincmd.ru (where the plugins are actually hosted) in your web browser with an advice from the post https://ghisler.ch/board/viewtopic.php?p=397562#p397562 (do the record for the same IP in the hosts file for wincmd.ru too).

Or, temporarily (while totalcmd.ru and wincmd.ru domains are not accessible), you can use their "preview" domains: on totalcmd.net's preview domain, open the plugin page (for this plugin it will be http://xhmhk.hosts.cx/plugring/pdfOCR.html), then copy donwload link and change the "xhmhk" domain name part there to wincmd.ru's "preview" one, "ob9gr".
This way, for pdfOCR plugin the download link will be http://ob9gr.hosts.cx/download.php?id=pdfOCR; or, if you can get a copy of a direct link to a file, which is shown in a tooltip over "Download" link on the plugin page, then you can change that link the same way, so for this plugin it will be: http://ob9gr.hosts.cx/files/9924358/wdx_pdfOCR_0.9.rar.

mgroen · Post by *mgroen » 2021-03-21, 13:05 UTC

I waited a couple of days, I downloaded the file wdx_pdfOCR_0.9.rar

But now??
I double clicked on the rar file, TC asked me to install the plugin, I did.

Then I restarted TC ,
and moved to a folder which contains pdf files,

but no columns are displayed like: "pages", "Need OCR" etc.

I use TC 9.51 64bit.

Any tips/info on how to proceed?

Dalai · Post by *Dalai » 2021-03-21, 13:11 UTC

2mgroen
Add the custom columns you need: https://www.ghisler.ch/wiki/index.php?title=Custom_columns
They don't magically appear.

Regards
Dalai

mgroen · Post by *mgroen » 2021-03-22, 10:05 UTC

Dalai wrote: ↑2021-03-13, 21:49 UTC 2Usher
totalcmd.net currently points to the wrong IP addresses, on several important (if not all) DNS servers, Quad9, 1.1.1.1 and Google among them. No system reboot, access via HTTPS or DNS cache flush is going to help with this. The only options are to wait and/or to add the correct IP address to the hosts file as pointed out by Flint.

Regards
Dalai

how is it possible that totalcmd.net points to the wrong IP address?

mgroen · Post by *mgroen » 2021-03-22, 10:08 UTC

Dalai wrote: ↑2021-03-21, 13:11 UTC 2mgroen
Add the custom columns you need: https://www.ghisler.ch/wiki/index.php?title=Custom_columns
They don't magically appear.

Regards
Dalai

wtf? this page is displayed in 2 languages? All of a sudden English is switched for German? ????

Total Commander

WDX plugin pdfOCR - Show details of PDF files

WDX plugin pdfOCR - Show details of PDF files

Love the idea of the plug in

Re: Love the idea of the plug in

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files

Re: WDX plugin pdfOCR - Show details of PDF files