Find the word "web" in docx files is always succes

spamme1 · Post by *spamme1 » 2016-01-12, 15:37 UTC

In a collection of 20 words files (docx), that I have written, I was looking for one, where I had written something about the "web". So I searched for all the file with the word "web" in the text, with my surprise all the docx documents appeared in the result pane. When I open the first two and searched for the word "web" in the text, it didn't find it. In some way total commander find the word "web" in all docx documents, although is not part of the document content.

ZoSTeR · Post by *ZoSTeR » 2016-01-12, 17:14 UTC

Word Docx files are zipped XML files (and others) which contain multiple references to "webSettings".

You could try the plugin Office2007wdx and search using the plugin field "content".

Post by *ghisler(Author) » 2016-01-14, 11:21 UTC

TC opens the docx files (they are renamed ZIP) and then searches in the XML files for the word you try to find. Apparently "web" appears there even if it doesn't appear in the text.

You can look inside a docx with Ctrl+PageDown. When I look at [Content_type.xml], I find for example this:
Override PartName="/word/webSettings.xml"

Lefteous · Post by *Lefteous » 2016-01-14, 11:25 UTC

2ghisler(Author)
At least for Word files (*.docx) the words contained in the document are stored in 'document.xml' only. Searching only there still isn't a docx->txt conversion but it's better to search all xml files in the zip.

Post by *ghisler(Author) » 2016-01-14, 11:27 UTC

Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?

Lefteous · Post by *Lefteous » 2016-01-14, 11:30 UTC

2ghisler(Author)

Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?

All these formats are offically documented but for pure text search it's not rocket science to do research on these formats. A good strategy is to create files with weird words and then perform a search. You'll find the right files or even the right values in xml element quite fast.

I think the integrated function to search such files should work well. Otherwise it would be better to leave it to plugin authors.

MVV · Post by *MVV » 2016-01-14, 18:30 UTC

I agree that separate 'search in office documents' function should work differently than just 'search in archives'. But of course it isn't easy to detect which files should be ignored in Office documents... Documentation is open, yes, but it is HUGE!

spamme1 · Post by *spamme1 » 2016-01-15, 01:18 UTC

[quote="ghisler(Author)"]Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?[/quote]

Yes you are right, I just checked the docx and I saw that all the document content is also xml content. Clearly to search only in the xml content the file has to be parsed and the xml document has to be built. I suppose that total commander just searches the xml files as plain text files.

Post by *ghisler(Author) » 2016-01-18, 10:35 UTC

Yes it does. Office xml files are some of the most complex files in existence, not even OpenOffice/LibreOffice is able to import everything correctly. It would be far beyond the possibilities of a simple file manager to parse them correctly.

MVV · Post by *MVV » 2016-01-18, 10:47 UTC

Maybe you should just strip the tags like you do in Lister for HTML files? This will exclude all tag names from search results and leave only pure text. I think this difference from regular search in archives should be acceptable for Office documents.

Horst.Epp · Post by *Horst.Epp » 2016-01-18, 11:24 UTC

Da nehme ich doch einfach die Windows Suche.
Die findet bei mir unter Windows 10 über die IFilter ohne Probleme Inhalte von Doc und Docx Files.
Den Pfad zum gefundenen File kann ich dann im TC über die Zwischenablage direkt anspringen.
Das mache ich im TC mit einem Button wie folgt:
cmd=c:\tools\NirSoft\nircmd.exe exec show %COMMANDER_EXE% /O /S /A /L="~$clipboard$
Den Pfad bekomme ich im Windows Suchergebnis per Context Menue
mittels des Tools ShimExt

Lefteous · Post by *Lefteous » 2016-01-18, 12:29 UTC

The plugin TextSearch does the above mentioned tasks (really convert files to text).

The question is what is the right way to go?
- Okay to have 'basic' solution in TC, use plugin if built-in is not enough
- Full implementation required in TC, basic function is awkward
- Basic function is awkward so better remove it, use plugin

MVV · Post by *MVV » 2016-01-18, 15:59 UTC

I think basic embedded solution would be enough, complete parser may be done as a plugin.

Horst.Epp · Post by *Horst.Epp » 2016-01-18, 16:43 UTC

MVV wrote:I think basic embedded solution would be enough, complete parser may be done as a plugin.

There is a way to use the Windows Desktop search results.
I suggest to make this instead of complicated TC solutions or plugins.
Its realized in xplorer² ultimate for example.
http://zabkat.com/tour3.htm

milo1012 · Post by *milo1012 » 2016-01-18, 18:41 UTC

I think the OT problem is rare.
99 percent of different words can be found w/o triggering ambiguous results.
But even if not: you can still use some RegEx search or mask, to find your term only if it's not in between some tag, or only if it's near different words, etc.
So I think the built-in search is fine for now, and for a refined search you can use plug-ins.
But IMO we should see a hint for the OT problem in the help file.

BTW, I'm currently working on supporting Oracle OiT Content/Text Access for PCREsearch, and enabling full wdx text search for it,
especially since Christian confirmed Unicode wdx text search for TC 9.
It would give us some reliable and stable text content access (of course, it has it's limits also).

It would be nice if all wdx fields with full text search would appear in the main search dialog's tab, for easier access, and for showing the user some alternate engine to use.