xPDFSearch 1.11 - Content plugin to search text in PDF files
Moderators: Hacker, petermad, Stefan2, white
What doesn't work?Already done about three years ago Wink Laughing
I tested a simple pdf and all the functions I tested work, here are a few screenshots
http://i.imgur.com/R2RV3kp.png
http://i.imgur.com/kNfRFbz.png
both custom fields and custom search support Cyrillic. Unicode is not needed for Cyrillic, although the default font used needs to have Cyr glyphs, I'm using 7 but as I upgraded from XP my default font for most of TC is Tahoma and it has Cyrillic glyphs, it might be an issue with your font.
edit.
I just had a thought, if the pdf was generated with some old transliterated fonts (7bit fonts that had only cyr characters) then there is no way for xpdf to display them properly that's a font limitation, I remember someone sending my a pdf with such fonts that weren't embedded, the pdf looked strange, the only way around that is to recreate the pdf with proper 8-bit ascii or unicode fonts.
Quite a vague statement, isn't it?iana wrote:Unicode is not needed for Cyrillic...
Unicode is required! If your system isn't set to Cyrillic you can't search for these characters,
especially if you're on some remote Workstation where you're not allowed to switch system settings.
I have tons of technical documents where information/text is stored in CJK characters.
There is no way I can search them with xPDFSearch the way it is now.
IDK what's the special case with Ovg's system,
but independent from that, it is time to finally have a Unicode variant for ft_fulltext.
This has nothing to do with PDF-embedded or system font.iana wrote:I just had a thought, if the pdf was generated with some old transliterated fonts (7bit fonts that had only cyr characters) then there is no way for xpdf to display them properly that's a font limitation
The PDF is simply decoded with Xpdf - no fonts required because it's not displayed or rendered -
and the decoded data is transferred to TC (in portions), and is being searched.
TC plugins: PCREsearch and RegXtract
I am for full unicode support, but that does not change the fact that unicode is not needed for Cyrillic support, there are around 2*35 (70+) Cyrillic characters, most 8-bit fonts with 256 symbols do include not just Latin but Greek and Cyrillic support, I was replaying to Ovg as he said xpdf didn't work with Cyrillic it does, those old transliterated fonts wore popular in the early 90's (they replace the latin with cyr symbols) most people have stopped using them but there are old documents that are in general badly generated pdf's, xpdf will and has displayed those chars as latin even if you have the fonts installed or embedded.
http://www.fonts2u.com/mac-c-times.font
a lot of documents wore generated using it, there is no way tc via xpdf would display the content of that document properly, you would need to set that font as the default tc font or regenerate the pdf. As the font used for metadata can not be changed most of the information content for those old documents would be Latin
but if you have that font set as the default tc font, xpdf will display it correctly that's why I said it's a font issue, for example this is a popular font in my country (it has no latin glyphs, the cyr are encoded with a lower id # corresponding to it's latin cousin)This has nothing to do with PDF-embedded or system font.
The PDF is simply decoded with Xpdf - no fonts required because it's not displayed or rendered -
and the decoded data is transferred to TC (in portions), and is being searched.
http://www.fonts2u.com/mac-c-times.font
a lot of documents wore generated using it, there is no way tc via xpdf would display the content of that document properly, you would need to set that font as the default tc font or regenerate the pdf. As the font used for metadata can not be changed most of the information content for those old documents would be Latin
Again, this is just wrong, or at least vague, depending on your system.iana wrote:but that does not change the fact that unicode is not needed for Cyrillic support...
What link is there between fonts and character recoding?iana wrote:there are around 2*35 (70+) Cyrillic characters, most 8-bit fonts with 256 symbols do include not just Latin but Greek and Cyrillic support
We're talking about a text search here, no display at all for that purpose.
You're entering a character in TCs text box for xPDFSearch. If you enter Cyrillic characters, TC recodes them to the system's ANSI page.
My system page is 1252, so there are no Cyrillic characters there, and I'll get a replacement character (question mark).
Now, when TC searches the raw characters that are streamed from xPDFSearch, it's just absolutely unlikely that you'll get a match that way.
Xpdf, just like most other programs, rely on the system ANSI page.iana wrote:that's why I said it's a font issue
So it's no Font issue, but a system-setting issue.
If this page can't map the TC input to the xPDFSearch output, there is no match, just like I said above (1252 et. al. has no Cyrillic).
It just doesn't matter if xpdf correctly recodes the characters, we can't match them the way it is now.
TC plugins: PCREsearch and RegXtract
Is a mupdf version/fork possible?
the sumatrapdf guys have a command line app (available if you build it your self) that uses mupdf and dumps not only pdf info but epub/mobi/cbz/cbr... info too
https://code.google.com/p/sumatrapdf/source/browse/trunk/src/EngineDump.cpp
http://mupdf.com/docs/overview
I believe mupdf is more actively developed then xpdf, and mupdf has/is a native win library.
the sumatrapdf guys have a command line app (available if you build it your self) that uses mupdf and dumps not only pdf info but epub/mobi/cbz/cbr... info too
https://code.google.com/p/sumatrapdf/source/browse/trunk/src/EngineDump.cpp
http://mupdf.com/docs/overview
I believe mupdf is more actively developed then xpdf, and mupdf has/is a native win library.
Moin,
Fileinfo zeigt ein gelbes Ausrufezeichen ohne Sanduhr bei der Kernell32.dll.
Wenn ich dann den Baum weiter aufklappe stehen da keine weiteren kritischen Abhängigkeiten.
[It seems to be the kernell32.dll. The fileinfoplugin gives an exclamation mark without hourglas. Other dependences are not listed.]
Gruß Iowa
Fileinfo zeigt ein gelbes Ausrufezeichen ohne Sanduhr bei der Kernell32.dll.
Wenn ich dann den Baum weiter aufklappe stehen da keine weiteren kritischen Abhängigkeiten.
[It seems to be the kernell32.dll. The fileinfoplugin gives an exclamation mark without hourglas. Other dependences are not listed.]
Gruß Iowa