[9.0 b14 x64] TC - file rename tool and diacritic

stusek · Post by *stusek » 2016-09-15, 13:07 UTC

In this version and previous doesn't work correct renaming files (search and replace). I can't replace characters with diacritic. Diacritic letter aren't found and replaced in file names...

Post by *ghisler(Author) » 2016-09-16, 08:26 UTC

Not confirmed. Please give me more details:
1. The name of the file you try to change
2. The "Search for" string
3. The "Replace with" string
4. Your language and country settings in Windows (e.g. English, USA)

stusek · Post by *stusek » 2016-09-16, 09:15 UTC

1.
Požadavek na ukončení zaměstnance.xls
Požadavek na nového zaměstnance.xls

2. ž and others characters (ěčí)
3. z (eci)
4. my language is Czech, czech file name, windows is in czech localization

After apply replace, file name is not change and characters in filename are show same as before (left side and right side preview window show same, but in find input is ž and in replace input is writed z).

Then I rename files manually. When I press backspace to delete character "ž" from filename, this character was deleted but instead this deleted character is showed "z". "z" character was not showed before deleted "ž".

This same issue was in other diacritic characters.

Maybe search/replace dialog add not diacritic variant to filename to correct positions, but not showed and diacritic characters still viewed.

I test this character replace secondly, and then manual file renaming in classic windows explorer with same behaviour. After delete "ž" is showed on same position "z"...

These files I received by e-mail. When I create new blank txt file in my PC with same file name, rename work correct.

I examined the files at hand and found, that these office files was created in Microsoft Macintosh Office.

I don't know, if you can this problem correct in TC, or this issue is in Windows explorer functions (bad characters encoding from Mac).

Thank you and sorry for bad English...

Post by *ghisler(Author) » 2016-09-16, 09:51 UTC

The problem is that the ž in the name is not the same as the ž you are using for search+replace:

The first is a z followed by a reversed ^ character. Unicode codes 007A and 030C.

The second is a single character with Unicode code 017E.

The former is mainly used on MacOS, the latter on Windows.

What you can do is create a search+replace rule with both types:
Search for: ž|ž
Replace with: z|z

milo1012 · Post by *milo1012 » 2016-09-16, 16:15 UTC

We should call this characters encoding problem by name:
Unicode normalization

One solution is to occasionally scan all file names on disk for the NFD form with my NFCname plug-in and use MRT to convert all such names to the NFC form before doing any other rename operation.

Post by *ghisler(Author) » 2016-09-18, 10:31 UTC

OK, I will add a new placeholder to convert all composite Unicode characters (e.g. separate a and ^) to precomposed characters (â, with accent). The user will have to write:
[N]
instead of
[N]
for this conversion.

milo1012 · Post by *milo1012 » 2016-09-18, 19:44 UTC

ghisler(Author) wrote:OK, I will add a new placeholder to convert all composite Unicode characters (e.g. separate a and ^) to precomposed characters...

Good to hear.
But I'm curious: what function or lib do you want to use?
Originally I wanted to use a static big lookup table of character replacements, but I couldn't find one - at least when you'd want to cover the complete Unicode plane - and some people said that this isn't possible anyway, due to the number of combination possibilities, or when using a wild mixture of different nomalization forms in the file name. Additionally, such tables might need an update if a newer Unicode standard adds new characters.
You can see that even converter tools like
http://www.w3.org/International/charlint/
don't use simple lookup tables.

When I started my plug-in, I used IsNormalizedString and NormalizeString, but these functions exist on Vista and higher only.
So I switched to the official ICU lib (International Components for Unicode), but it will add quite a bunch of code, the plug-in is therefore nearly one MB big.

Post by *ghisler(Author) » 2016-09-21, 19:41 UTC

I'm using FoldString with option MAP_PRECOMPOSED. It's NT based system only, but I'm loading it dynamically - and I don't think that this has any relevance on Windows 9x/ME.

milo1012 · Post by *milo1012 » 2016-09-21, 22:03 UTC

Thx for the info.
If I read it correctly, this function provides the full normalization only on Vista and later, so on XP/2000 we're probably stuck to Unicode < 4.0. So not fully portable (in terms of functionality) between different OSes.

Still, it's working as it should and is probably good enough for most basic diacritics (but maybe not for CJK characters).

Post by *ghisler(Author) » 2016-09-23, 19:31 UTC

Well, it's the best I could find. And since XP is end of life, there isn't really much to complain...

redfox · Post by *redfox » 2016-09-24, 09:07 UTC

XP will be among us for a very long time yet. Only MS needed to force new Win versions, which are all worse except for Win7.

Horst.Epp · Post by *Horst.Epp » 2016-09-24, 10:36 UTC

redfox wrote:XP will be among us for a very long time yet. Only MS needed to force new Win versions, which are all worse except for Win7.

You are right for some poor poeple but it should be no longer drive design decisions.

j7n · Post by *j7n » 2016-09-24, 16:56 UTC

On Windows 2K/XP, the caron and other combining diacritics appear misaligned, and can be selected as separate symbols (and may show up as box character in older or less complete fonts), so the nature of the problem is immediately clear. I've only encountered such filenames in recent few years. Mac "thinks differently"...