distinction between searchable/non searchable PDF files

mgroen · Post by *mgroen » 2021-03-02, 14:57 UTC

Is TotalCommander able to set a marker so that the user is able to distinct searchable / non searchable PDF files from eachother?
For example, by creating a column "searchable" and put an S (or any marker) on each line for searchable PDF files?
(and not set a marker for non searchables)

Explanation:
I have lots of PDF files, some of them are searchable, some not.
I want to get an overview of searchables/non searcables, not by opening each file manually and check if its searchable.

The main goal behind this question is that I have lots of pdf files and I want to make them all searchable but to do that I first need to have an overview of which pdf files are already searchable and which not.

Is TotalCommander able to help me?

Thanks,
Mathijs

gdpr deleted 6 · Post by *gdpr deleted 6 » 2021-03-02, 15:21 UTC

I don't know what you mean exactly by saying "searchable/non searchable". Are you trying to refer to PDF's that contain text as text, and PDFs that only contains images of scanned pages, but not text itself?

I don't know, and i don't want to speculate any further. However, whispers in the dark (aka the forum search) tell me of two content plug-ins providing different kinds information about PDF files. Amongst the types of information provided, some might be helpful to you.

xPDFSearch:
http://lefteous.totalcmd.net/tc/xpdfsearch_eng.htm
viewtopic.php?t=7423

pdfOCR:
viewtopic.php?t=41504

Horst.Epp · Post by *Horst.Epp » 2021-03-04, 08:51 UTC

User mgroen also asked for such a function in the XYplorer forum.
User highend provided a script which does it perfectly for me in XYplorer.

Code: Select all

$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
    $output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
    if ($output) { return "S"; }

This allows to create a custom column to display or search for the searchable attribute of PDFs.

mgroen · Post by *mgroen » 2021-03-04, 10:08 UTC

elgonzo wrote: 2021-03-02, 15:21 UTC I don't know what you mean exactly by saying "searchable/non searchable". Are you trying to refer to PDF's that contain text as text, and PDFs that only contains images of scanned pages, but not text itself?

I don't know, and i don't want to speculate any further. However, whispers in the dark (aka the forum search) tell me of two content plug-ins providing different kinds information about PDF files. Amongst the types of information provided, some might be helpful to you.

xPDFSearch:
http://lefteous.totalcmd.net/tc/xpdfsearch_eng.htm
viewtopic.php?t=7423

pdfOCR:
viewtopic.php?t=41504

Exactly. PDFs can be either searchable or non-searchable. You can check it yourself by opening a pdf file and type CNTRL-F and type any search term and see if its found (or not). From the outside (not opening a file yourself and enter cntrl-f), it is hard to tell if a pdf is searchable or not. But there could be programmatic/scripting intelligence that could do that and check for searchability. I prefer a file manager which has this feature built in, I am a bit hesitant for using scripts. But I wil check out your links. I was hoping TotalCommander has this feature built in, or has plans to implement it.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2021-03-04, 10:57 UTC

mgroen wrote: 2021-03-04, 10:08 UTC I was hoping TotalCommander has this feature built in, or has plans to implement it.

There is little reason to believe that a general-purpose file manager would have such a very specific feature built-in, nor is there reason to believe that such a general-purpose file manger will ever exist. It's too "niche"...

But the thing is, it doesn't have to be built-in. That's precisely why applications like TC have plug-in mechanisms: to allow the user community (or others) to extend the functionality of TC with "niche" features that are too specific, too narrow, useful to only a few users, outside of what could be considered "general-purpose", or whatever. This allows the developer of TC to (hopefully) concentrate on implementing and (more importantly) maintaing a useful core feature set, while providing an avenue for volunteer efforts to provide functionality that otherwise would be unfavourable effort/benefit proposition for the developer of TC.

The pdfOCR plug-in sounds like a good fit for your needs (i hope). But be aware (as mentioned in the topic i linked to), it seems to be rather slow, and potentially renders TC unresponsive while it is processing the PDFs in the directory you want to list in a or both file panels in TC...

mgroen · Post by *mgroen » 2021-03-07, 13:45 UTC

To make it more clear what I need, I made a screenshot.

In short again: I need an overview of files with filenames and a mark/display if PDF file is searchable or not.

here is what I need (see attached screenshot below):

It's made from another filemanager, bu the idea is the same.

I'm not going to discuss whether or not its niche or not. If the functionality is there, great I would love to use it. If its not there of course I hope to see it, and let other people vote for this feature request, then we'll see if this is niche or not. If its to niche to built in the core thats fine too but then let someone built it in a plug-in that I (and other) would be able to use.

[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]

gdpr deleted 6 · Post by *gdpr deleted 6 » 2021-03-07, 14:04 UTC

mgroen wrote: 2021-03-07, 13:45 UTC [...] let someone built it in a plug-in that I (and other) would be able to use.

[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]

Uh, it's almost like nobody has suggested existing plug-ins to you and you still need to wait for someone writing such a plug-in.
I am a bit confused, i have to admit.

In my previous comment, i suggested you should give pdfOCR a try. In my first comment, i prodived the link to the forum topic about this plug-in. That forum topic literally starts with:

slavne wrote: 2014-12-10, 20:38 UTC WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.

(Emphasis mine...)

I feel like you would benefit from a dose of coffeine or two...

Horst.Epp · Post by *Horst.Epp » 2021-03-07, 14:59 UTC

elgonzo wrote: 2021-03-07, 14:04 UTC
mgroen wrote: 2021-03-07, 13:45 UTC [...] let someone built it in a plug-in that I (and other) would be able to use.

[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]
Uh, it's almost like nobody has suggested existing plug-ins to you and you still need to wait for someone writing such a plug-in.
I am a bit confused, i have to admit.

In my previous comment, i suggested you should give pdfOCR a try. In my first comment, i prodived the link to the forum topic about this plug-in. That forum topic literally starts with:

slavne wrote: 2014-12-10, 20:38 UTC WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.
(Emphasis mine...)

I feel like you would benefit from a dose of coffeine or two...

In the XYplorer forum his topic was closed because he didn't try any suggestion
even the ones which do exactly what he requested using only native XY methods.
So any suggestion in this forum is also useless I guess

tuska · Post by *tuska » 2021-03-07, 18:52 UTC

2mgroen
Out of interest, I looked into the topic and had immediate success with the "WDX plugin pdfOCR" mentioned above.
Link: pdfOCR 0.9

I assume that you know how to install a plugin -> Download, then double-click on file "wdx_pdfOCR_0.9.rar" in Total Commander(!).
For me, the installed plugin is then shown under menu "Configuration" - "Options..." - "Content plugins (.WDX)": Button "Configure" e.g. as follows: %COMMANDER_PATH%\Plugins\wdx\pdfOCR\pdfTrebaOcr.wdx

When creating a "Custom columns - View", "pdftrebaocr" must therefore be selected as the plug-in!
You probably already know how to "Configure custom columns...", if not, just let me know - or look here:

Custom column:

Code: Select all

Caption: needOCR ... Field contents: [=pdftrebaocr.needOCR]

This means that your column with heading "searchable" shown in the picture is then called "needOCR".
However, the column heading is freely selectable, i.e. you can also call it "searchable":

Code: Select all

Caption: searchable ... Field contents: [=pdftrebaocr.needOCR]

If the column "needOCR" contains a value <>0, then the file has pages which are non searchable [and needs character recognition (OCR)].

http://totalcmd.net/plugring/pdfOCR.html wrote:pdfOCR 0.9 - Purpose:
pdfOCR is wdx plugin that discovers how many pages of PDF file in current directory needs character recognition (OCR),
i.e. how many pages in PDF file have no searchable text in their layout. ...

First experiences:

totalPages -> should the value: -3 be displayed when applying the custom column, for example,
then simply press CTRL+left arrow key or CTRL+right arrow key (depending on which TC window is active).
Afterwards, the correct number of pages is updated in the other TC window.
Option: Additionally set up the field [=xpdfsearch.Number of Pages] of the xPDFSearch plugin.
In case of a protected pdf-document for which a password has to be entered when trying to open it,
you may get the following view in the respective columns with these two plugins:
Code: Select all
```
---------------------------------------------------
[=xpdfsearch.Encrypted]  |  [=pdftrebaocr.password]
             blank       |             No
---------------------------------------------------
```
(Adobe Acrobat XI Version 11.0.23)

Basically, however, the display of the number of pages in the column "needOCR" of PDF files that are not searchable
worked very well for me. I have carried out several tests...

I wish you success!

Regards,
Karl

PS: The setup of the plugin with required columns should be done in 5 minutes.
PPS: Let us know if you succeeded in setting it up.

tuska · Post by *tuska » 2021-03-08, 09:44 UTC

If you take the idea of this topic a little further, then I am convinced that, for example, a user (not me) with scripting skills
(e.g. AHK, etc.) can create a script that can be used with the plugin WinScript Advanced Content Plugin
to set up another column, for example.

Target:

Code: Select all

IF content of field "needOCR" [=pdftrebaocr.needOCR] = content of field "Number of pages" [=xpdfsearch.Number of Pages] *),
THEN display the value "=", for example.

(With programming you could control which value, e.g. >50%, etc. you want to display in the new column).

*) [=xpdfsearch.Number of Pages] determines the total number of pages more reliably than "totalPages" [=pdftrebaocr.totalPages],
which often shows the value: -3 after the first application of the "Custom columns - View".
(A solution for this problem has already been mentioned above).

Purpose:
This is used to determine if ALL PAGES of pdf documents are NOT SEARCHABLE.

(Note: If, for example, in a document with 76 pages, only the value: 2 (=2 pages) is displayed in the column "needOCR",
then one can actually assume that it is a searchable PDF document, in which, for example, only 2 non-searchable images are present).

It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).

Horst.Epp · Post by *Horst.Epp » 2021-03-08, 10:36 UTC

tuska wrote: 2021-03-08, 09:44 UTC ...
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).

A script which uses pdftotext from xpdf-Tools
takes only about 18 seconds to scan 220 PDF files to find if they are searchable or not.
Not a bad time and no one makes such searches more than once for the same files.
The script is currently running in my other file manager XYplorer.

tuska · Post by *tuska » 2021-03-08, 14:19 UTC

Horst.Epp wrote: 2021-03-08, 10:36 UTC
tuska wrote: 2021-03-08, 09:44 UTC ...
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).
A script which uses pdftotext from xpdf-Tools
takes only about 18 seconds to scan 220 PDF files to find if they are searchable or not.
Not a bad time and no one makes such searches more than once for the same files.
The script is currently running in my other file manager XYplorer.

This solution should then be interesting for mgroen: "I have lots of PDF files".

I usually have only a few pdf files and manage well with the plugin "pdfOCR".

A test showed me that for a folder with 41 pdf files (average file size per file: approx. 23 MB)
the plugin "pdfOCR" takes about 1 minute until you can work "fluently" in this folder again.

gdpr deleted 6 · Post by *gdpr deleted 6 » 2021-03-08, 14:25 UTC

tuska wrote: 2021-03-08, 14:19 UTC A test showed me that for a folder with 41 pdf files (average file size per file: approx. 23 MB)
the plugin "pdfOCR" takes about 1 minute until you can work "fluently" in this folder again.

Ouch!

tuska · Post by *tuska » 2021-03-08, 14:42 UTC

elgonzo wrote: 2021-03-08, 14:25 UTC
tuska wrote: 2021-03-08, 14:19 UTC A test showed me that for a folder with 41 pdf files (average file size per file: approx. 23 MB)
the plugin "pdfOCR" takes about 1 minute until you can work "fluently" in this folder again.
Ouch!

The pain is not so great if, for example, you only have 37 pdf files (average file size per file: approx. 0.3 MB),
because the plugin "pdfOCR" then only needs about 10 seconds until you can work "smoothly" in this folder again.

Horst.Epp · Post by *Horst.Epp » 2021-03-08, 14:53 UTC

tuska wrote: 2021-03-08, 14:19 UTC ...
This solution should then be interesting for mgroen: "I have lots of PDF files".

We gave him this solution in the XY forum and the actual XYplorer even can do it now without external tools.
But he wasn't realy doing or testing anything and its thread was closed
because of complaining without reacting at all to the suggestions.

Total Commander

distinction between searchable/non searchable PDF files

distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files

Re: distinction between searchable/non searchable PDF files