Preliminary information about Unicode support (TC7.5)

Post by *ghisler(Author) » 2016-06-10, 18:58 UTC

Yes it is:
#define ft_fulltextw 12

ft_fulltextw New in 2.11: Same as ft_fulltext, but with UTF-16 encoding. May be returned instead of ft_fulltext.

Check for version 2.11 in function ContentSetDefaultParams before returning ft_fulltextw. It should work to mix ft_fulltext and ft_fulltextw: Return ft_fulltext for the field type, and then return either ft_fulltext or ft_fulltextw for the files TC tries to search.

milo1012 · Post by *milo1012 » 2016-06-14, 17:48 UTC

I did some early tests implementing ft_fulltextw for my APK-wdx plugin and experienced crashes too, just like
http://www.ghisler.ch/board/viewtopic.php?p=309435#309435

Questions:
The old wdx plugin guide stated:
"It writes the first block of maxlen-1 bytes to FieldValue and returns ft_fulltext. The data written must be a 0-terminated string!"

1)
I assume I still have to treat maxlen as the number of bytes, not as the number of UTF-16 characters, meaning I can effectively return only half this number of UTF-16 chars?

2)
Because for a zero-terminated UTF-16 string the terminating zero is a UTF-16 char (2 bytes) too, I assume we need to return maxlen-2 bytes of string data and set the last two bytes to zero to make the string null-terminated?
2a)
If this is the case, maxlen should actually be an even number, or otherwise you'd need to cut between two bytes of an UTF-16 character. But this doesn't seem to be the case, maxlen is an odd number (see post below).

milo1012 · Post by *milo1012 » 2016-06-15, 14:30 UTC

I did some further tests, and it seems that there is some bug in the fulltext search procedure anyway - independently from the Unicode implementation.
For example, I stream a 14939 byte text to TC.
TC will call ContentGetValue eight times, with maxlen set to:
2047
2035
2035
2035
2035
2035
2035
2035
I returned ft_fulltext each time, and each time the string was correctly appended with a binary zero (i.e. returning maxlen-1 bytes of the text string at most).
So I actually streamed:
2046
2034
2034
2034
2034
2034
2034
689
bytes for each call.
For the 9th call of ContentGetValue I returned ft_fieldempty.

Now, I can find my search terms on any spot, with one exception: the last 689 bytes of the stream.
No matter what I try, TC won't find any single character from this area, although it was obviously correctly received.
It seems that the last text portion is just ignored or skipped in the search, although being zero-terminated and still returning ft_fulltext.

I tested this behavior to be the same in TC 8.x and TC 9.
Is there something wrong in my implementation, or is this an actual bug?

Post by *ghisler(Author) » 2016-06-15, 21:14 UTC

Please try beta 2 for ft_fulltextw, I fixed a few bugs with the help of Lefteous' PDF plugin.

For ft_fulltext, did you return all string 0-terminated? And the one with 689 bytes needs to return ft_fulltext too!

milo1012 · Post by *milo1012 » 2016-06-15, 21:59 UTC

ghisler(Author) wrote:For ft_fulltext, did you return all string 0-terminated? And the one with 689 bytes needs to return ft_fulltext too!

Yes, I terminated it correctly and I returned ft_fulltext for the last portion, like I said above.

ghisler(Author) wrote:Please try beta 2 for ft_fulltextw, I fixed a few bugs with the help of Lefteous' PDF plugin.

Indeed, Beta 2 seems to fix things, but:

The problem with the last string portions remains when I return ft_fulltext (ANSI), but it seems to works correctly when I return ft_fulltextw (UTF-16).
So whatever you did to fix the latter, you probably should apply it to the ANSI function as well.

I need to investigate Beta 2 if It can find all Unicode strings for real. What I found so far:
There seems to be differences between using the front page plugin search and the classic "plugin tab" search.
The latter finds strings that are definitely not existing in the text!
The front page plugin search looks good so far.

BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.
Not sure if this makes sense for the average user, because you know the encoding of the text for this file "in advance", so you probably should search in it even if this option is not enabled.

On top of that, could you clarify what I asked above:
- using a single or double byte terminating zero for Unicode fulltext?
- still treat maxlen as number of bytes?
- if using double byte zero and maxlen representing the number of bytes: how do you expect to treat maxlen with an odd number?
(and in any case, these things should be stated clearly in the next wdx plugin guide version)

Post by *ghisler(Author) » 2016-06-15, 22:32 UTC

OK, I will check it!

"Unicode UTF-16" shouldn't be necessary for ft_fulltextw...

milo1012 · Post by *milo1012 » 2016-06-15, 22:40 UTC

Update: the problem with the last text portion seems to emerge only when using the classic "plugin tab" search, the front page plugin search seems to work okay (for ANSI and UTF-16).
This means this bug was also present in pre TC 9.

Post by *ghisler(Author) » 2016-06-27, 15:23 UTC

BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.

Do you still have that problem? I cannot reproduce it here.

milo1012 · Post by *milo1012 » 2016-06-27, 17:58 UTC

ghisler(Author) wrote:
BTW, I realized that you have to tick "Unicode UTF-16" to make ft_fulltextw working on the new TC 9 front page plugin search.
Do you still have that problem? I cannot reproduce it here.

I can't remember the actual strings I searched for, but it definitely was some CJK string that was only found with that option enabled.
Maybe I can repeat it somehow.

I can now do some repeatable tests anyway, with my APK-wdx 2.1 plugin.
Just additionally register the plugin as a Lister plugin, view some apk file with Lister and search some non-ASCII string from that Lister output (the String pool) with TC 9 fulltext front page search.

Ah, I found an example with my plugin, (use TC 32.bit):
Take the newest TC tcandroid272.apk
In there you have this Chinese string the very end:

Code: Select all

Totalcmd 已不正常關閉...

Try to find it with UTF-16 enabled: works.
With UTF-16 disabled: does not
Seems to happen with 32-bit only, the x64 seems to work as expected.

Update: The x64 version is bugged too, as I said above: it finds strings that are not there.
Modify the above string to:

Code: Select all

Totalcmd 已不正常關...

This string is not in the String pool for real, but TC finds it in the x64 version anyway, no matter if UTF-16 enabled or not!

The same happens with some string from the "middle", e.g. take

Code: Select all

Chinese traditional (繁體中文)

mod it to

Code: Select all

Chinese traditional (繁中文)

-> TC x64 still finds it!

Post by *ghisler(Author) » 2016-06-28, 20:24 UTC

So TC is finding the text with the UTF-16 search method, not with the plugin. It must be present in the file as plain text Unicode...

milo1012 · Post by *milo1012 » 2016-06-28, 21:15 UTC

ghisler(Author) wrote:So TC is finding the text with the UTF-16 search method, not with the plugin. It must be present in the file as plain text Unicode...

If you're referring to the 2nd problem: No.
First of all: why does the x32 not find them? And 2nd:
While there are some visible UTF-16 strings, probably resulting from not compressing (just storing) some app files in the zip archive,

Code: Select all

Totalcmd 已不正常關...

and

Code: Select all

Chinese traditional (繁中文)

are just not available when I treat the apk file as UTF-16, easily seen with TC's Lister and switching to UTF-16 view ("6").
(the ellipses in the first string are part of the text!)

Also I can confirm the described behavior with some different apk files.

Concerning the first problem: This seems to happen with the last text portion only. All other strings work fine, including those not found as plain UTF-16 text in the apk file.

Post by *ghisler(Author) » 2016-06-28, 21:26 UTC

I need a test plugin and test apk to verify that, I cannot confirm it with the plugins I tried, sorry.

milo1012 · Post by *milo1012 » 2016-06-29, 00:42 UTC

Alright, I did some further tests, and:
Forget about the last string portion bug, it seems it was my fault due to a wrong loop variable in my plugin. Unfortunately it was masked in x64 with the bug I decribe below in detail.
The bug concerning checking "Unicode UTF-16" to make ft_fulltextw working was also related to my last string problem, as I was mostly testing with strings from the very end (which can also be found by TC's raw text search).
Which reminds me: it would be nice to know from which source the result came, i.e. if a plugin search led to finding a specific file, or the normal TC raw text search.

Nevertheless, my tests confirmed the other bug in the x64 version, finding strings that are definitely not there:

ghisler(Author) wrote:I need a test plugin and test apk to verify that, I cannot confirm it with the plugins I tried, sorry.

It works with your very own TC for android apk file and the plugin, both which I already linked above:
http://tcce.s3.amazonaws.com/tcandroid272.apk
http://wincmd.ru/files/9924355/wdx_APK-wdx_2_1.rar

The bug seems to be triggered by spaces in the search string. The type of characters apparently doesn't matter.
Search for

Code: Select all

An unexpected error occurred

(which is one of the first strings in the apk after the filenames)
(standard search, no RegEx)
1st call: UnitIndex = 0, maxlen = 2047, I set everything up and copy the first 2044 bytes, return ft_fulltextw
2nd call: UnitIndex = 2044, maxlen = 2019, I copy 2016 bytes, return ft_fulltextw
TC stops after that, no additional cleanup call with UnitIndex = -1 and showing the apk as containing the text, although it only searched the first 4k bytes and the string doesn't appear until ~39k bytes!

Slightly different but still wrong:
Search for

Code: Select all

unexpected error

1st call: UnitIndex = 0, maxlen = 2047, I set everything up and copy the first 2044 bytes, return ft_fulltextw
2nd call: UnitIndex = 2044, maxlen = 2031, I copy 2028 bytes, return ft_fulltextw
3rd call: UnitIndex = -1, I clean everything up and return ft_fieldempty
So TC again stops and shows the apk as containing the text, although it only searched the first 4k bytes, but at least it made a cleanup call!

Searching for any single word string, like for

Code: Select all

unexpected

in the above example, works as expected.

This is repeatable with any search string that contains 1/2 spaces, and two spaces triggering no cleanup call.
Also it seems that the words separated by spaces need to have some minimum length in order to trigger it (6 characters?).

I tested this with a clean TC 9.0 beta3 install in a new/clean virtual machine environment -> same result
So you definitely should be able to verify it for x64, and also verify that 32-bit TC does not show this behavior.
Like I said: in debug mode I was able to confirm that I send the correct string portions to TC. Even doing a simple MessagBoxW() on the FieldValue pointer (after copying the bytes) is enough to confirm this.

Strangely I wasn't able to trigger this behavior with Lefteous' xPDFSearch beta and the same text saved as a PDF file, though I don't know how he coded his plugin.

milo1012 · Post by *milo1012 » 2016-07-13, 20:48 UTC

This bug is still present in TC 9 beta 5 x64.

Now that I have an (optional Oracle OiT) fulltext search in PCREsearch, the same behavior as with the APK file shows.
Just use e.g. this odt file (which contains the very same text from the APK) and search with the x64 plug-in [=pcresearch.Oracle Outside In fulltext search]
for the same sample strings that I used above:

Code: Select all

Totalcmd 已不正常關...
Chinese traditional (繁中文)

TC finds it, although not being present in the file at all!
And like I said: TC 32-bit works fine.

Post by *ghisler(Author) » 2016-07-13, 21:15 UTC

Sorry, where do I find this oracle plugin? I do not have this problem with my own plugins, or PDF plugin.