distinction between searchable/non searchable PDF files
Moderators: Hacker, petermad, Stefan2, white
distinction between searchable/non searchable PDF files
Is TotalCommander able to set a marker so that the user is able to distinct searchable / non searchable PDF files from eachother?
For example, by creating a column "searchable" and put an S (or any marker) on each line for searchable PDF files?
(and not set a marker for non searchables)
Explanation:
I have lots of PDF files, some of them are searchable, some not.
I want to get an overview of searchables/non searcables, not by opening each file manually and check if its searchable.
The main goal behind this question is that I have lots of pdf files and I want to make them all searchable but to do that I first need to have an overview of which pdf files are already searchable and which not.
Is TotalCommander able to help me?
Thanks,
Mathijs
For example, by creating a column "searchable" and put an S (or any marker) on each line for searchable PDF files?
(and not set a marker for non searchables)
Explanation:
I have lots of PDF files, some of them are searchable, some not.
I want to get an overview of searchables/non searcables, not by opening each file manually and check if its searchable.
The main goal behind this question is that I have lots of pdf files and I want to make them all searchable but to do that I first need to have an overview of which pdf files are already searchable and which not.
Is TotalCommander able to help me?
Thanks,
Mathijs
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: distinction between searchable/non searchable PDF files
I don't know what you mean exactly by saying "searchable/non searchable". Are you trying to refer to PDF's that contain text as text, and PDFs that only contains images of scanned pages, but not text itself?
I don't know, and i don't want to speculate any further. However, whispers in the dark (aka the forum search) tell me of two content plug-ins providing different kinds information about PDF files. Amongst the types of information provided, some might be helpful to you.
xPDFSearch:
http://lefteous.totalcmd.net/tc/xpdfsearch_eng.htm
viewtopic.php?t=7423
pdfOCR:
viewtopic.php?t=41504
I don't know, and i don't want to speculate any further. However, whispers in the dark (aka the forum search) tell me of two content plug-ins providing different kinds information about PDF files. Amongst the types of information provided, some might be helpful to you.
xPDFSearch:
http://lefteous.totalcmd.net/tc/xpdfsearch_eng.htm
viewtopic.php?t=7423
pdfOCR:
viewtopic.php?t=41504
Re: distinction between searchable/non searchable PDF files
User mgroen also asked for such a function in the XYplorer forum.
User highend provided a script which does it perfectly for me in XYplorer.
This allows to create a custom column to display or search for the searchable attribute of PDFs.
User highend provided a script which does it perfectly for me in XYplorer.
Code: Select all
$tool = "C:\Tools\xpdf-tools\pdftotext.exe";
$output = trim(runret("""$tool"" -simple -nopgbrk ""<cc_item>"" -", %TEMP%, 65001), <crlf>, "R");
if ($output) { return "S"; }
Windows 11 Home, Version 24H2 (OS Build 26100.3915)
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
Re: distinction between searchable/non searchable PDF files
Exactly. PDFs can be either searchable or non-searchable. You can check it yourself by opening a pdf file and type CNTRL-F and type any search term and see if its found (or not). From the outside (not opening a file yourself and enter cntrl-f), it is hard to tell if a pdf is searchable or not. But there could be programmatic/scripting intelligence that could do that and check for searchability. I prefer a file manager which has this feature built in, I am a bit hesitant for using scripts. But I wil check out your links. I was hoping TotalCommander has this feature built in, or has plans to implement it.elgonzo wrote: 2021-03-02, 15:21 UTC I don't know what you mean exactly by saying "searchable/non searchable". Are you trying to refer to PDF's that contain text as text, and PDFs that only contains images of scanned pages, but not text itself?
I don't know, and i don't want to speculate any further. However, whispers in the dark (aka the forum search) tell me of two content plug-ins providing different kinds information about PDF files. Amongst the types of information provided, some might be helpful to you.
xPDFSearch:
http://lefteous.totalcmd.net/tc/xpdfsearch_eng.htm
viewtopic.php?t=7423
pdfOCR:
viewtopic.php?t=41504
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: distinction between searchable/non searchable PDF files
There is little reason to believe that a general-purpose file manager would have such a very specific feature built-in, nor is there reason to believe that such a general-purpose file manger will ever exist. It's too "niche"...mgroen wrote: 2021-03-04, 10:08 UTC I was hoping TotalCommander has this feature built in, or has plans to implement it.
But the thing is, it doesn't have to be built-in. That's precisely why applications like TC have plug-in mechanisms: to allow the user community (or others) to extend the functionality of TC with "niche" features that are too specific, too narrow, useful to only a few users, outside of what could be considered "general-purpose", or whatever. This allows the developer of TC to (hopefully) concentrate on implementing and (more importantly) maintaing a useful core feature set, while providing an avenue for volunteer efforts to provide functionality that otherwise would be unfavourable effort/benefit proposition for the developer of TC.
The pdfOCR plug-in sounds like a good fit for your needs (i hope). But be aware (as mentioned in the topic i linked to), it seems to be rather slow, and potentially renders TC unresponsive while it is processing the PDFs in the directory you want to list in a or both file panels in TC...
Re: distinction between searchable/non searchable PDF files
To make it more clear what I need, I made a screenshot.
In short again: I need an overview of files with filenames and a mark/display if PDF file is searchable or not.
here is what I need (see attached screenshot below):
It's made from another filemanager, bu the idea is the same.
I'm not going to discuss whether or not its niche or not. If the functionality is there, great I would love to use it. If its not there of course I hope to see it, and let other people vote for this feature request, then we'll see if this is niche or not. If its to niche to built in the core thats fine too but then let someone built it in a plug-in that I (and other) would be able to use.
[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]
In short again: I need an overview of files with filenames and a mark/display if PDF file is searchable or not.
here is what I need (see attached screenshot below):
It's made from another filemanager, bu the idea is the same.
I'm not going to discuss whether or not its niche or not. If the functionality is there, great I would love to use it. If its not there of course I hope to see it, and let other people vote for this feature request, then we'll see if this is niche or not. If its to niche to built in the core thats fine too but then let someone built it in a plug-in that I (and other) would be able to use.
[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: distinction between searchable/non searchable PDF files
Uh, it's almost like nobody has suggested existing plug-ins to you and you still need to wait for someone writing such a plug-in.mgroen wrote: 2021-03-07, 13:45 UTC [...] let someone built it in a plug-in that I (and other) would be able to use.
[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]
I am a bit confused, i have to admit.
In my previous comment, i suggested you should give pdfOCR a try. In my first comment, i prodived the link to the forum topic about this plug-in. That forum topic literally starts with:
(Emphasis mine...)slavne wrote: 2014-12-10, 20:38 UTC WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.
I feel like you would benefit from a dose of coffeine or two...

Re: distinction between searchable/non searchable PDF files
In the XYplorer forum his topic was closed because he didn't try any suggestionelgonzo wrote: 2021-03-07, 14:04 UTCUh, it's almost like nobody has suggested existing plug-ins to you and you still need to wait for someone writing such a plug-in.mgroen wrote: 2021-03-07, 13:45 UTC [...] let someone built it in a plug-in that I (and other) would be able to use.
[img]https://i.ibb.co/hKwHBWB/2021-03-05-13-28-48.jpg[/img]
I am a bit confused, i have to admit.
In my previous comment, i suggested you should give pdfOCR a try. In my first comment, i prodived the link to the forum topic about this plug-in. That forum topic literally starts with:
(Emphasis mine...)slavne wrote: 2014-12-10, 20:38 UTC WDX plugin pdfOCR is intended to show the number of pages in a pdf file that need OCR processing. With the help of pdfOCR plugin, you can immediately spot which pdf files are unavailable for text search, either by you or by some indexing system. That is the purpose of needOCR column.
I feel like you would benefit from a dose of coffeine or two...![]()
even the ones which do exactly what he requested using only native XY methods.
So any suggestion in this forum is also useless I guess

Windows 11 Home, Version 24H2 (OS Build 26100.3915)
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
Re: distinction between searchable/non searchable PDF files
2mgroen
Out of interest, I looked into the topic and had immediate success with the "WDX plugin pdfOCR" mentioned above.
Link: pdfOCR 0.9
I assume that you know how to install a plugin -> Download, then double-click on file "wdx_pdfOCR_0.9.rar" in Total Commander(!).
For me, the installed plugin is then shown under menu "Configuration" - "Options..." - "Content plugins (.WDX)": Button "Configure" e.g. as follows: %COMMANDER_PATH%\Plugins\wdx\pdfOCR\pdfTrebaOcr.wdx
When creating a "Custom columns - View", "pdftrebaocr" must therefore be selected as the plug-in!
You probably already know how to "Configure custom columns...", if not, just let me know - or look here:
This means that your column with heading "searchable" shown in the picture is then called "needOCR".
However, the column heading is freely selectable, i.e. you can also call it "searchable":
If the column "needOCR" contains a value <>0, then the file has pages which are non searchable [and needs character recognition (OCR)].
worked very well for me. I have carried out several tests...
I wish you success!
Regards,
Karl
PS: The setup of the plugin with required columns should be done in 5 minutes.
PPS: Let us know if you succeeded in setting it up.
Out of interest, I looked into the topic and had immediate success with the "WDX plugin pdfOCR" mentioned above.
Link: pdfOCR 0.9
I assume that you know how to install a plugin -> Download, then double-click on file "wdx_pdfOCR_0.9.rar" in Total Commander(!).
For me, the installed plugin is then shown under menu "Configuration" - "Options..." - "Content plugins (.WDX)": Button "Configure" e.g. as follows: %COMMANDER_PATH%\Plugins\wdx\pdfOCR\pdfTrebaOcr.wdx
When creating a "Custom columns - View", "pdftrebaocr" must therefore be selected as the plug-in!
You probably already know how to "Configure custom columns...", if not, just let me know - or look here:
- Creation of a "Custom columns - View" - Example (Simply follow the steps given for the plug-in "pdfOCR")
- FAQs: Create a user defined 'Custom Column'
Code: Select all
Caption: needOCR ... Field contents: [=pdftrebaocr.needOCR]
However, the column heading is freely selectable, i.e. you can also call it "searchable":
Code: Select all
Caption: searchable ... Field contents: [=pdftrebaocr.needOCR]
First experiences:http://totalcmd.net/plugring/pdfOCR.html wrote:pdfOCR 0.9 - Purpose:
pdfOCR is wdx plugin that discovers how many pages of PDF file in current directory needs character recognition (OCR),
i.e. how many pages in PDF file have no searchable text in their layout. ...
- totalPages -> should the value: -3 be displayed when applying the custom column, for example,
then simply press CTRL+left arrow key or CTRL+right arrow key (depending on which TC window is active).
Afterwards, the correct number of pages is updated in the other TC window.
Option: Additionally set up the field [=xpdfsearch.Number of Pages] of the xPDFSearch plugin.
- In case of a protected pdf-document for which a password has to be entered when trying to open it,
you may get the following view in the respective columns with these two plugins:(Adobe Acrobat XI Version 11.0.23)Code: Select all
--------------------------------------------------- [=xpdfsearch.Encrypted] | [=pdftrebaocr.password] blank | No ---------------------------------------------------
worked very well for me. I have carried out several tests...
I wish you success!
Regards,
Karl
PS: The setup of the plugin with required columns should be done in 5 minutes.
PPS: Let us know if you succeeded in setting it up.
Re: distinction between searchable/non searchable PDF files
If you take the idea of this topic a little further, then I am convinced that, for example, a user (not me) with scripting skills
(e.g. AHK, etc.) can create a script that can be used with the plugin WinScript Advanced Content Plugin
to set up another column, for example.
Target:
(With programming you could control which value, e.g. >50%, etc. you want to display in the new column).
*) [=xpdfsearch.Number of Pages] determines the total number of pages more reliably than "totalPages" [=pdftrebaocr.totalPages],
which often shows the value: -3 after the first application of the "Custom columns - View".
(A solution for this problem has already been mentioned above).
Purpose:
This is used to determine if ALL PAGES of pdf documents are NOT SEARCHABLE.
(Note: If, for example, in a document with 76 pages, only the value: 2 (=2 pages) is displayed in the column "needOCR",
then one can actually assume that it is a searchable PDF document, in which, for example, only 2 non-searchable images are present).
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).
(e.g. AHK, etc.) can create a script that can be used with the plugin WinScript Advanced Content Plugin
to set up another column, for example.
Target:
Code: Select all
IF content of field "needOCR" [=pdftrebaocr.needOCR] = content of field "Number of pages" [=xpdfsearch.Number of Pages] *),
THEN display the value "=", for example.
*) [=xpdfsearch.Number of Pages] determines the total number of pages more reliably than "totalPages" [=pdftrebaocr.totalPages],
which often shows the value: -3 after the first application of the "Custom columns - View".
(A solution for this problem has already been mentioned above).
Purpose:
This is used to determine if ALL PAGES of pdf documents are NOT SEARCHABLE.
(Note: If, for example, in a document with 76 pages, only the value: 2 (=2 pages) is displayed in the column "needOCR",
then one can actually assume that it is a searchable PDF document, in which, for example, only 2 non-searchable images are present).
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).
Re: distinction between searchable/non searchable PDF files
A script which uses pdftotext from xpdf-Toolstuska wrote: 2021-03-08, 09:44 UTC ...
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).
takes only about 18 seconds to scan 220 PDF files to find if they are searchable or not.
Not a bad time and no one makes such searches more than once for the same files.
The script is currently running in my other file manager XYplorer.
Windows 11 Home, Version 24H2 (OS Build 26100.3915)
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
Re: distinction between searchable/non searchable PDF files
This solution should then be interesting for mgroen: "I have lots of PDF files".Horst.Epp wrote: 2021-03-08, 10:36 UTCA script which uses pdftotext from xpdf-Toolstuska wrote: 2021-03-08, 09:44 UTC ...
It's just a pity that the plugin "pdfOCR" generally takes a long time for the evaluation... (as already said above).
takes only about 18 seconds to scan 220 PDF files to find if they are searchable or not.
Not a bad time and no one makes such searches more than once for the same files.
The script is currently running in my other file manager XYplorer.
I usually have only a few pdf files and manage well with the plugin "pdfOCR".
A test showed me that for a folder with 41 pdf files (average file size per file: approx. 23 MB)
the plugin "pdfOCR" takes about 1 minute until you can work "fluently" in this folder again.
-
- Power Member
- Posts: 872
- Joined: 2013-09-04, 14:07 UTC
Re: distinction between searchable/non searchable PDF files
Ouch!tuska wrote: 2021-03-08, 14:19 UTC A test showed me that for a folder with 41 pdf files (average file size per file: approx. 23 MB)
the plugin "pdfOCR" takes about 1 minute until you can work "fluently" in this folder again.
Re: distinction between searchable/non searchable PDF files
The pain is not so great if, for example, you only have 37 pdf files (average file size per file: approx. 0.3 MB),
because the plugin "pdfOCR" then only needs about 10 seconds until you can work "smoothly" in this folder again.

Re: distinction between searchable/non searchable PDF files
We gave him this solution in the XY forum and the actual XYplorer even can do it now without external tools.tuska wrote: 2021-03-08, 14:19 UTC ...
This solution should then be interesting for mgroen: "I have lots of PDF files".
But he wasn't realy doing or testing anything and its thread was closed
because of complaining without reacting at all to the suggestions.
Windows 11 Home, Version 24H2 (OS Build 26100.3915)
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64
TC 11.51 x64 / x86
Everything 1.5.0.1391a (x64), Everything Toolbar 1.5.2.0, Listary Pro 6.3.2.88
QAP 11.6.4.2.1 x64