Preliminary information about Unicode support (TC7.5)
Moderators: Hacker, petermad, Stefan2, white
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Yes it is:
#define ft_fulltextw 12
ft_fulltextw New in 2.11: Same as ft_fulltext, but with UTF-16 encoding. May be returned instead of ft_fulltext.
Check for version 2.11 in function ContentSetDefaultParams before returning ft_fulltextw. It should work to mix ft_fulltext and ft_fulltextw: Return ft_fulltext for the field type, and then return either ft_fulltext or ft_fulltextw for the files TC tries to search.
#define ft_fulltextw 12
ft_fulltextw New in 2.11: Same as ft_fulltext, but with UTF-16 encoding. May be returned instead of ft_fulltext.
Check for version 2.11 in function ContentSetDefaultParams before returning ft_fulltextw. It should work to mix ft_fulltext and ft_fulltextw: Return ft_fulltext for the field type, and then return either ft_fulltext or ft_fulltextw for the files TC tries to search.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
I did some early tests implementing ft_fulltextw for my APK-wdx plugin and experienced crashes too, just like
http://www.ghisler.ch/board/viewtopic.php?p=309435#309435
Questions:
The old wdx plugin guide stated:
"It writes the first block of maxlen-1 bytes to FieldValue and returns ft_fulltext. The data written must be a 0-terminated string!"
1)
I assume I still have to treat maxlen as the number of bytes, not as the number of UTF-16 characters, meaning I can effectively return only half this number of UTF-16 chars?
2)
Because for a zero-terminated UTF-16 string the terminating zero is a UTF-16 char (2 bytes) too, I assume we need to return maxlen-2 bytes of string data and set the last two bytes to zero to make the string null-terminated?
2a)
If this is the case, maxlen should actually be an even number, or otherwise you'd need to cut between two bytes of an UTF-16 character. But this doesn't seem to be the case, maxlen is an odd number (see post below).
http://www.ghisler.ch/board/viewtopic.php?p=309435#309435
Questions:
The old wdx plugin guide stated:
"It writes the first block of maxlen-1 bytes to FieldValue and returns ft_fulltext. The data written must be a 0-terminated string!"
1)
I assume I still have to treat maxlen as the number of bytes, not as the number of UTF-16 characters, meaning I can effectively return only half this number of UTF-16 chars?
2)
Because for a zero-terminated UTF-16 string the terminating zero is a UTF-16 char (2 bytes) too, I assume we need to return maxlen-2 bytes of string data and set the last two bytes to zero to make the string null-terminated?
2a)
If this is the case, maxlen should actually be an even number, or otherwise you'd need to cut between two bytes of an UTF-16 character. But this doesn't seem to be the case, maxlen is an odd number (see post below).
Last edited by milo1012 on 2016-06-15, 16:15 UTC, edited 2 times in total.
TC plugins: PCREsearch and RegXtract
I did some further tests, and it seems that there is some bug in the fulltext search procedure anyway - independently from the Unicode implementation.
For example, I stream a 14939 byte text to TC.
TC will call ContentGetValue eight times, with maxlen set to:
2047
2035
2035
2035
2035
2035
2035
2035
I returned ft_fulltext each time, and each time the string was correctly appended with a binary zero (i.e. returning maxlen-1 bytes of the text string at most).
So I actually streamed:
2046
2034
2034
2034
2034
2034
2034
689
bytes for each call.
For the 9th call of ContentGetValue I returned ft_fieldempty.
Now, I can find my search terms on any spot, with one exception: the last 689 bytes of the stream.
No matter what I try, TC won't find any single character from this area, although it was obviously correctly received.
It seems that the last text portion is just ignored or skipped in the search, although being zero-terminated and still returning ft_fulltext.
I tested this behavior to be the same in TC 8.x and TC 9.
Is there something wrong in my implementation, or is this an actual bug?
For example, I stream a 14939 byte text to TC.
TC will call ContentGetValue eight times, with maxlen set to:
2047
2035
2035
2035
2035
2035
2035
2035
I returned ft_fulltext each time, and each time the string was correctly appended with a binary zero (i.e. returning maxlen-1 bytes of the text string at most).
So I actually streamed:
2046
2034
2034
2034
2034
2034
2034
689
bytes for each call.
For the 9th call of ContentGetValue I returned ft_fieldempty.

No matter what I try, TC won't find any single character from this area, although it was obviously correctly received.
It seems that the last text portion is just ignored or skipped in the search, although being zero-terminated and still returning ft_fulltext.
I tested this behavior to be the same in TC 8.x and TC 9.
Is there something wrong in my implementation, or is this an actual bug?
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Please try beta 2 for ft_fulltextw, I fixed a few bugs with the help of Lefteous' PDF plugin.
For ft_fulltext, did you return all string 0-terminated? And the one with 689 bytes needs to return ft_fulltext too!
For ft_fulltext, did you return all string 0-terminated? And the one with 689 bytes needs to return ft_fulltext too!
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Yes, I terminated it correctly and I returned ft_fulltext for the last portion, like I said above.ghisler(Author) wrote:For ft_fulltext, did you return all string 0-terminated? And the one with 689 bytes needs to return ft_fulltext too!
Indeed, Beta 2 seems to fix things, but:ghisler(Author) wrote:Please try beta 2 for ft_fulltextw, I fixed a few bugs with the help of Lefteous' PDF plugin.
The problem with the last string portions remains when I return ft_fulltext (ANSI), but it seems to works correctly when I return ft_fulltextw (UTF-16).
So whatever you did to fix the latter, you probably should apply it to the ANSI function as well.
I need to investigate Beta 2 if It can find all Unicode strings for real. What I found so far:
There seems to be differences between using the front page plugin search and the classic "plugin tab" search.
The latter finds strings that are definitely not existing in the text!
The front page plugin search looks good so far.
BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.
Not sure if this makes sense for the average user, because you know the encoding of the text for this file "in advance", so you probably should search in it even if this option is not enabled.
On top of that, could you clarify what I asked above:
- using a single or double byte terminating zero for Unicode fulltext?
- still treat maxlen as number of bytes?
- if using double byte zero and maxlen representing the number of bytes: how do you expect to treat maxlen with an odd number?
(and in any case, these things should be stated clearly in the next wdx plugin guide version)
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
OK, I will check it!
"Unicode UTF-16" shouldn't be necessary for ft_fulltextw...
"Unicode UTF-16" shouldn't be necessary for ft_fulltextw...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Update: the problem with the last text portion seems to emerge only when using the classic "plugin tab" search, the front page plugin search seems to work okay (for ANSI and UTF-16).
This means this bug was also present in pre TC 9.
This means this bug was also present in pre TC 9.
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Do you still have that problem? I cannot reproduce it here.BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
I can't remember the actual strings I searched for, but it definitely was some CJK string that was only found with that option enabled.ghisler(Author) wrote:Do you still have that problem? I cannot reproduce it here.BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.
Maybe I can repeat it somehow.
I can now do some repeatable tests anyway, with my APK-wdx 2.1 plugin.
Just additionally register the plugin as a Lister plugin, view some apk file with Lister and search some non-ASCII string from that Lister output (the String pool) with TC 9 fulltext front page search.
Ah, I found an example with my plugin, (use TC 32.bit):
Take the newest TC tcandroid272.apk
In there you have this Chinese string the very end:
Code: Select all
Totalcmd 已不正常關閉...
With UTF-16 disabled: does not
Seems to happen with 32-bit only, the x64 seems to work as expected.
Update: The x64 version is bugged too, as I said above: it finds strings that are not there.
Modify the above string to:
Code: Select all
Totalcmd 已不正常關...
The same happens with some string from the "middle", e.g. take
Code: Select all
Chinese traditional (繁體中文)
Code: Select all
Chinese traditional (繁中文)
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
So TC is finding the text with the UTF-16 search method, not with the plugin. It must be present in the file as plain text Unicode...
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
If you're referring to the 2nd problem: No.ghisler(Author) wrote:So TC is finding the text with the UTF-16 search method, not with the plugin. It must be present in the file as plain text Unicode...
First of all: why does the x32 not find them? And 2nd:
While there are some visible UTF-16 strings, probably resulting from not compressing (just storing) some app files in the zip archive,
Code: Select all
Totalcmd 已不正常關...
Code: Select all
Chinese traditional (繁中文)
(the ellipses in the first string are part of the text!)
Also I can confirm the described behavior with some different apk files.
Concerning the first problem: This seems to happen with the last text portion only. All other strings work fine, including those not found as plain UTF-16 text in the apk file.
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
I need a test plugin and test apk to verify that, I cannot confirm it with the plugins I tried, sorry.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Alright, I did some further tests, and:
Forget about the last string portion bug, it seems it was my fault due to a wrong loop variable in my plugin. Unfortunately it was masked in x64 with the bug I decribe below in detail.
The bug concerning checking "Unicode UTF-16" to make ft_fulltextw working was also related to my last string problem, as I was mostly testing with strings from the very end (which can also be found by TC's raw text search).
Which reminds me: it would be nice to know from which source the result came, i.e. if a plugin search led to finding a specific file, or the normal TC raw text search.
Nevertheless, my tests confirmed the other bug in the x64 version, finding strings that are definitely not there:
http://tcce.s3.amazonaws.com/tcandroid272.apk
http://wincmd.ru/files/9924355/wdx_APK-wdx_2_1.rar
The bug seems to be triggered by spaces in the search string. The type of characters apparently doesn't matter.
Search for
(which is one of the first strings in the apk after the filenames)
(standard search, no RegEx)
1st call: UnitIndex = 0, maxlen = 2047, I set everything up and copy the first 2044 bytes, return ft_fulltextw
2nd call: UnitIndex = 2044, maxlen = 2019, I copy 2016 bytes, return ft_fulltextw
TC stops after that, no additional cleanup call with UnitIndex = -1 and showing the apk as containing the text, although it only searched the first 4k bytes and the string doesn't appear until ~39k bytes!
Slightly different but still wrong:
Search for
1st call: UnitIndex = 0, maxlen = 2047, I set everything up and copy the first 2044 bytes, return ft_fulltextw
2nd call: UnitIndex = 2044, maxlen = 2031, I copy 2028 bytes, return ft_fulltextw
3rd call: UnitIndex = -1, I clean everything up and return ft_fieldempty
So TC again stops and shows the apk as containing the text, although it only searched the first 4k bytes, but at least it made a cleanup call!
Searching for any single word string, like for
in the above example, works as expected.
This is repeatable with any search string that contains 1/2 spaces, and two spaces triggering no cleanup call.
Also it seems that the words separated by spaces need to have some minimum length in order to trigger it (6 characters?).
I tested this with a clean TC 9.0 beta3 install in a new/clean virtual machine environment -> same result
So you definitely should be able to verify it for x64, and also verify that 32-bit TC does not show this behavior.
Like I said: in debug mode I was able to confirm that I send the correct string portions to TC. Even doing a simple MessagBoxW() on the FieldValue pointer (after copying the bytes) is enough to confirm this.
Strangely I wasn't able to trigger this behavior with Lefteous' xPDFSearch beta and the same text saved as a PDF file, though I don't know how he coded his plugin.
Forget about the last string portion bug, it seems it was my fault due to a wrong loop variable in my plugin. Unfortunately it was masked in x64 with the bug I decribe below in detail.
The bug concerning checking "Unicode UTF-16" to make ft_fulltextw working was also related to my last string problem, as I was mostly testing with strings from the very end (which can also be found by TC's raw text search).
Which reminds me: it would be nice to know from which source the result came, i.e. if a plugin search led to finding a specific file, or the normal TC raw text search.
Nevertheless, my tests confirmed the other bug in the x64 version, finding strings that are definitely not there:
It works with your very own TC for android apk file and the plugin, both which I already linked above:ghisler(Author) wrote:I need a test plugin and test apk to verify that, I cannot confirm it with the plugins I tried, sorry.
http://tcce.s3.amazonaws.com/tcandroid272.apk
http://wincmd.ru/files/9924355/wdx_APK-wdx_2_1.rar
The bug seems to be triggered by spaces in the search string. The type of characters apparently doesn't matter.
Search for
Code: Select all
An unexpected error occurred
(standard search, no RegEx)
1st call: UnitIndex = 0, maxlen = 2047, I set everything up and copy the first 2044 bytes, return ft_fulltextw
2nd call: UnitIndex = 2044, maxlen = 2019, I copy 2016 bytes, return ft_fulltextw
TC stops after that, no additional cleanup call with UnitIndex = -1 and showing the apk as containing the text, although it only searched the first 4k bytes and the string doesn't appear until ~39k bytes!
Slightly different but still wrong:
Search for
Code: Select all
unexpected error
2nd call: UnitIndex = 2044, maxlen = 2031, I copy 2028 bytes, return ft_fulltextw
3rd call: UnitIndex = -1, I clean everything up and return ft_fieldempty
So TC again stops and shows the apk as containing the text, although it only searched the first 4k bytes, but at least it made a cleanup call!
Searching for any single word string, like for
Code: Select all
unexpected
This is repeatable with any search string that contains 1/2 spaces, and two spaces triggering no cleanup call.
Also it seems that the words separated by spaces need to have some minimum length in order to trigger it (6 characters?).
I tested this with a clean TC 9.0 beta3 install in a new/clean virtual machine environment -> same result
So you definitely should be able to verify it for x64, and also verify that 32-bit TC does not show this behavior.
Like I said: in debug mode I was able to confirm that I send the correct string portions to TC. Even doing a simple MessagBoxW() on the FieldValue pointer (after copying the bytes) is enough to confirm this.
Strangely I wasn't able to trigger this behavior with Lefteous' xPDFSearch beta and the same text saved as a PDF file, though I don't know how he coded his plugin.
TC plugins: PCREsearch and RegXtract
This bug is still present in TC 9 beta 5 x64.
Now that I have an (optional Oracle OiT) fulltext search in PCREsearch, the same behavior as with the APK file shows.
Just use e.g. this odt file (which contains the very same text from the APK) and search with the x64 plug-in [=pcresearch.Oracle Outside In fulltext search]
for the same sample strings that I used above:
TC finds it, although not being present in the file at all!
And like I said: TC 32-bit works fine.
Now that I have an (optional Oracle OiT) fulltext search in PCREsearch, the same behavior as with the APK file shows.
Just use e.g. this odt file (which contains the very same text from the APK) and search with the x64 plug-in [=pcresearch.Oracle Outside In fulltext search]
for the same sample strings that I used above:
Code: Select all
Totalcmd 已不正常關...
Chinese traditional (繁中文)
And like I said: TC 32-bit works fine.
TC plugins: PCREsearch and RegXtract
- ghisler(Author)
- Site Admin
- Posts: 50390
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Sorry, where do I find this oracle plugin? I do not have this problem with my own plugins, or PDF plugin.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com