Multi-rename tool, to replace non-english characters.

English support forum

Moderators: white, Hacker, petermad, Stefan2

Post Reply
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Multi-rename tool, to replace non-english characters.

Post by *JimmyTheBroker »

Hi guys,


I have a few files with non-english characters eg:
キャッ】39みゅトフードーじっく!【ドーじっオリ


I have no idea what it means, but they're throughout some of the filenames.
Any idea how i can remove them? (Using the Multi-rename tool, dealing with a few hundred files.)

thanks,
Jimmy 8)
I finally get notifications from emails again!!!
So happy!
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

Currently you can't - there are too many for simple search+replace. A content plugin would be nice.
Author of Total Commander
https://www.ghisler.com
User avatar
Stefan2
Power Member
Power Member
Posts: 4133
Joined: 2007-09-13, 22:20 UTC
Location: Europa

Post by *Stefan2 »

I don't know if that really works with "non-english characters",
but you can search and replace many different signs at once with MRT.


Press the F1-key while in the MRT and read:
Example: Replace Umlauts+Accents:
Search for: ä|ö|ü|é|è|ê|à
Replace with: ae|oe|ue|e|e|e|a

Maybe that works for you too?
Search for: キ|ャ|ッ|】|3|9|...
Replace with: _|_|_|_|_|...


- - -

Oh, to late, Mr. Ghisler already answered.
But maybe you can just run that search&replace a few times with different chars....



 
User avatar
ghisler(Author)
Site Admin
Site Admin
Posts: 48021
Joined: 2003-02-04, 09:46 UTC
Location: Switzerland
Contact:

Post by *ghisler(Author) »

The characters are Japanese phonetic characters called Katakana. There are half width and full width forms. Here are the charts from Unicode.org:
https://www.unicode.org/charts/PDF/U30A0.pdf
https://www.unicode.org/charts/PDF/UFF00.pdf

There are 96 different character codes for full width Katakana alone (including those with " and ° modifiers). You could do Search/Replace for these, but it would take a while to create. A small content plugin would be better.

And then there is the second phonetic alphabet, Hiragana, with about the same number of characters.

These could both be converted to Latin quite easily. But the ideaograms (chinese characters) also used in Japanese can't - they are read differently depending on where they are used.
Author of Total Commander
https://www.ghisler.com
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Post by *gdpr deleted 6 »

Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.

While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.

For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.

In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.

Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Post by *JimmyTheBroker »

I appreciate all the help.

I converted all the symbols i do want to keep to non-symbols and then got rid of the other character using \W, see below.
(I did it over a few steps but i think if i wanted to do it in one, i could have used the following)

Search for: -|\W|PPP
Replace with: PPP||-

thanks guys,
Jimmy 8)
I finally get notifications from emails again!!!
So happy!
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Post by *JimmyTheBroker »

elgonzo wrote:Keep in mind that if you want to reduce your file names to only the ASCII characters, you might end up with the same (ASCII) name for different files in the same directory.

While you could manually take care of such an issue in the MRT, this could become burdensome if "hundreds" of files are involved in a MRT operation.

For so many files, it probably is better to write a simple, tailor-made script (Powershell/.NET or any other preferred scripting language of your choice) that does the renaming while automatically handling possible file name collisions the way you want.

In such a script, each file name would be stripped from any character which is not in the ASCII value range (>= 0x80, i.e., non-english characters) or outside the "extended ASCII" range (>= 0x100). Before renaming the file , the script would check if the new file name would collide with an existing file and handle this situation according to your requirements.

Just because you have Total Commander running does not mean you _have_ to use Total Commander to do _any_ task. Choose the right tool for the job. You have more than one hammer in your toolkit.
good advice. I like it a lot, but thank goodness i didn't need to go that deep.


I did run into the multiple files with the same name, but MRT renames them with (1),(2) and then I was able to throw some other meta-data information in to distinguish them.
I finally get notifications from emails again!!!
So happy!
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Post by *gdpr deleted 6 »

JimmyTheBroker wrote:I did run into the multiple files with the same name, but MRT renames them with (1),(2) [...]
Hehe, yeah, you are right. This is one of the new features in TC 9.xx that completely flew under my radar... :lol:
User avatar
milo1012
Power Member
Power Member
Posts: 1158
Joined: 2012-02-02, 19:23 UTC

Post by *milo1012 »

Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:

Code: Select all

[\x{0250}-\x{FFFF}]
Replace with: _
[x]RegEx


(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)
Last edited by milo1012 on 2018-02-08, 19:37 UTC, edited 1 time in total.
TC plugins: PCREsearch and RegXtract
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Post by *gdpr deleted 6 »

milo1012 wrote:Why so complicated?
Just use RegEx and let TC find all characters in a certain range.
Example: keep ASCII and the basic latin and most european characters, but remove all "upper" code points:
Search for:

Code: Select all

[\x{0250}-\x{FFFF}]
Replace with: _
[x]RegEx
Okay, you convinced me that i need a vacation... ;)
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Post by *JimmyTheBroker »

Code: Select all

[\x{0250}-\x{FFFF}]
Oh awesome, thanks mate!


But I got another question about hexadecimal now.
milo1012 wrote:(it seems that TC's RegEx engine doesn't allow codepoints > U+FFFF - so everything up to U+10FFFD isn't searchable with such an expression)
I understand that hexadecimal's largest number possible with four digits is FFFF (or 65535 in decimal).

I'm assuming U+10FFFD means FFFFFFFFFD (9 F's and a D)
which would be 1099511627773 in base10.
That's over 1 trillion possible characters? which must be totally wrong.
I've googled a bit but cant seem to find the answer. Sorry for the basic question.
I finally get notifications from emails again!!!
So happy!
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Post by *gdpr deleted 6 »

JimmyTheBroker wrote:I'm assuming U+10FFFD means FFFFFFFFFD (9 F's and a D)
which would be 1099511627773 in base10.
No :)
10FFFD is just hexadecimal 0x10FFFD, which is 1114109 decimal.
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Post by *JimmyTheBroker »

I think I get it.

So the "U+" and "0x" are same thing, indicating that its a hex number.

When is U+ used and when is 0x (for what languages).

thanks
I finally get notifications from emails again!!!
So happy!
gdpr deleted 6
Power Member
Power Member
Posts: 872
Joined: 2013-09-04, 14:07 UTC

Post by *gdpr deleted 6 »

No. U+ specifically denotes a Unicode code point. The actual code point value is then given as hexadecimal number directly following U+.

"0x" is just a prefix commonly used to denote a hexadecimal number. By the way, "0x" is not the only way to denote a hex number, but it is by far the most common. Other ways used here and there to express a hexadecimal number like 0x1234 would be for example: 1234h, &H1234, $1234 (there are plenty more, but 3 examples are enough) - all denoting the same hexadecimal number 1234 (= 4660 decimal).

The programming language/software you are using (or documentation standard/style guide you are following) will dictate which notation you will have to use...
User avatar
JimmyTheBroker
Member
Member
Posts: 179
Joined: 2017-06-07, 05:22 UTC

Post by *JimmyTheBroker »

thanks!
I finally get notifications from emails again!!!
So happy!
Post Reply