Find the word "web" in docx files is always succes
Moderators: Hacker, petermad, Stefan2, white
Find the word "web" in docx files is always succes
In a collection of 20 words files (docx), that I have written, I was looking for one, where I had written something about the "web". So I searched for all the file with the word "web" in the text, with my surprise all the docx documents appeared in the result pane. When I open the first two and searched for the word "web" in the text, it didn't find it. In some way total commander find the word "web" in all docx documents, although is not part of the document content.
Word Docx files are zipped XML files (and others) which contain multiple references to "webSettings".
You could try the plugin Office2007wdx and search using the plugin field "content".
You could try the plugin Office2007wdx and search using the plugin field "content".
- ghisler(Author)
- Site Admin
- Posts: 50386
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
TC opens the docx files (they are renamed ZIP) and then searches in the XML files for the word you try to find. Apparently "web" appears there even if it doesn't appear in the text.
You can look inside a docx with Ctrl+PageDown. When I look at [Content_type.xml], I find for example this:
Override PartName="/word/webSettings.xml"
You can look inside a docx with Ctrl+PageDown. When I look at [Content_type.xml], I find for example this:
Override PartName="/word/webSettings.xml"
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
- ghisler(Author)
- Site Admin
- Posts: 50386
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
2ghisler(Author)
I think the integrated function to search such files should work well. Otherwise it would be better to leave it to plugin authors.
All these formats are offically documented but for pure text search it's not rocket science to do research on these formats. A good strategy is to create files with weird words and then perform a search. You'll find the right files or even the right values in xml element quite fast.Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?
I think the integrated function to search such files should work well. Otherwise it would be better to leave it to plugin authors.
[quote="ghisler(Author)"]Well, there are many xml office files out there, e.g. docx, xlsx, openoffice, libreoffice etc. How would I know which of the xml files I need to search?[/quote]
Yes you are right, I just checked the docx and I saw that all the document content is also xml content. Clearly to search only in the xml content the file has to be parsed and the xml document has to be built. I suppose that total commander just searches the xml files as plain text files.
Yes you are right, I just checked the docx and I saw that all the document content is also xml content. Clearly to search only in the xml content the file has to be parsed and the xml document has to be built. I suppose that total commander just searches the xml files as plain text files.
- ghisler(Author)
- Site Admin
- Posts: 50386
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Yes it does. Office xml files are some of the most complex files in existence, not even OpenOffice/LibreOffice is able to import everything correctly. It would be far beyond the possibilities of a simple file manager to parse them correctly.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Da nehme ich doch einfach die Windows Suche.
Die findet bei mir unter Windows 10 über die IFilter ohne Probleme Inhalte von Doc und Docx Files.
Den Pfad zum gefundenen File kann ich dann im TC über die Zwischenablage direkt anspringen.
Das mache ich im TC mit einem Button wie folgt:
cmd=c:\tools\NirSoft\nircmd.exe exec show %COMMANDER_EXE% /O /S /A /L="~$clipboard$
Den Pfad bekomme ich im Windows Suchergebnis per Context Menue
mittels des Tools ShimExt
Die findet bei mir unter Windows 10 über die IFilter ohne Probleme Inhalte von Doc und Docx Files.
Den Pfad zum gefundenen File kann ich dann im TC über die Zwischenablage direkt anspringen.
Das mache ich im TC mit einem Button wie folgt:
cmd=c:\tools\NirSoft\nircmd.exe exec show %COMMANDER_EXE% /O /S /A /L="~$clipboard$
Den Pfad bekomme ich im Windows Suchergebnis per Context Menue
mittels des Tools ShimExt
The plugin TextSearch does the above mentioned tasks (really convert files to text).
The question is what is the right way to go?
- Okay to have 'basic' solution in TC, use plugin if built-in is not enough
- Full implementation required in TC, basic function is awkward
- Basic function is awkward so better remove it, use plugin
The question is what is the right way to go?
- Okay to have 'basic' solution in TC, use plugin if built-in is not enough
- Full implementation required in TC, basic function is awkward
- Basic function is awkward so better remove it, use plugin
There is a way to use the Windows Desktop search results.MVV wrote:I think basic embedded solution would be enough, complete parser may be done as a plugin.
I suggest to make this instead of complicated TC solutions or plugins.
Its realized in xplorer² ultimate for example.
http://zabkat.com/tour3.htm
I think the OT problem is rare.
99 percent of different words can be found w/o triggering ambiguous results.
But even if not: you can still use some RegEx search or mask, to find your term only if it's not in between some tag, or only if it's near different words, etc.
So I think the built-in search is fine for now, and for a refined search you can use plug-ins.
But IMO we should see a hint for the OT problem in the help file.
BTW, I'm currently working on supporting Oracle OiT Content/Text Access for PCREsearch, and enabling full wdx text search for it,
especially since Christian confirmed Unicode wdx text search for TC 9.
It would give us some reliable and stable text content access (of course, it has it's limits also).
It would be nice if all wdx fields with full text search would appear in the main search dialog's tab, for easier access, and for showing the user some alternate engine to use.
99 percent of different words can be found w/o triggering ambiguous results.
But even if not: you can still use some RegEx search or mask, to find your term only if it's not in between some tag, or only if it's near different words, etc.
So I think the built-in search is fine for now, and for a refined search you can use plug-ins.
But IMO we should see a hint for the OT problem in the help file.
BTW, I'm currently working on supporting Oracle OiT Content/Text Access for PCREsearch, and enabling full wdx text search for it,
especially since Christian confirmed Unicode wdx text search for TC 9.
It would give us some reliable and stable text content access (of course, it has it's limits also).
It would be nice if all wdx fields with full text search would appear in the main search dialog's tab, for easier access, and for showing the user some alternate engine to use.
TC plugins: PCREsearch and RegXtract