Button Bar: UTF-8 list file w/o BOM
Moderators: Hacker, petermad, Stefan2, white
Button Bar: UTF-8 list file w/o BOM
I'm not sure if this has already been suggested (a quick search didn't found anything),
but I would like to have a BOM-less UTF-8 list file for the Button Bar (i.e. besides the %UL and %UF we have now).
I already mentioned it here (in german) some time ago,
which demonstrates the use with cmd.exe where the BOM hinders batch processing.
I'm writing my own programs which get started by the Button Bar and a list file,
where I have to filter the BOM manually every time.
The BOM for UTF-16 files makes sense, but for UTF-8 it's not that common to put it,
since nearly all modern parsers recognize UTF-8 sequences w/o BOM (this includes TC's compare by content !)
I think it would be trivial to add this feature, the only major task would be to find another parameter name.
(Suggestion: %uL, %uF - lower case u)
but I would like to have a BOM-less UTF-8 list file for the Button Bar (i.e. besides the %UL and %UF we have now).
I already mentioned it here (in german) some time ago,
which demonstrates the use with cmd.exe where the BOM hinders batch processing.
I'm writing my own programs which get started by the Button Bar and a list file,
where I have to filter the BOM manually every time.
The BOM for UTF-16 files makes sense, but for UTF-8 it's not that common to put it,
since nearly all modern parsers recognize UTF-8 sequences w/o BOM (this includes TC's compare by content !)
I think it would be trivial to add this feature, the only major task would be to find another parameter name.
(Suggestion: %uL, %uF - lower case u)
I don't think it is a good idea to parse whole list every time you need to detect if it is UTF-8 encoded or ANSI one, and it isn't always possible to detect UTF-8 (e.g. in case of short names), so BOM is useful here.
It is not hard at all to check if first three bytes of file matches the BOM and increase pointer by 3, and continue processing just like there are no BOM (if converting to UTF-16 is not required - however there are no standard functions or API that accepts UTF-8 w/o conversion).
And I doubt that cmd.exe will process UTF-8 encoded lists w/o BOM because most programs accept ANSI or UTF-16 parameters, but not UTF-8...
It is not hard at all to check if first three bytes of file matches the BOM and increase pointer by 3, and continue processing just like there are no BOM (if converting to UTF-16 is not required - however there are no standard functions or API that accepts UTF-8 w/o conversion).
And I doubt that cmd.exe will process UTF-8 encoded lists w/o BOM because most programs accept ANSI or UTF-16 parameters, but not UTF-8...
I appreciate your comments, but all you say is: you don't need this.
I don't think we need to discuss for why implementing it.
I have a lot of experience when it comes to handling UTF files, and it's a difference if you need to handle all sort of files
or just assume UTF-8 file names separated by newlines.
I'm already doing this in my plugins. The Scintilla based projects also do this since years.
As already mentioned, I have my reasons, e.g. Java handles UTF-8 and conversions quite well.
Also we're not talking about parameters, but a list file, which is read separately, not the command line.
I say it again: I have my own programs and reasons for it,
and I'm sure a lot of other people too.
And to quote Wikipedia:
You rarely see BOMs there, because they hinder parsing the same way as I mentioned earlier, modern parsers don't need a BOM.
So, I agree that the WinAPI basically just knows ANSI and UTF-16
(which is not exactly true, MS did support MBCS until some years ago),
but that doesn't mean there is no use for the UTF-8 variant,
otherwise we could just drop that option according to your point.
All I say is that the UTF-8 list file needs an optional BOM-less variant, because it's less trouble, that's all.
It neither hurts nor does it affect the already existing parameters.
I don't think we need to discuss for why implementing it.
I have a lot of experience when it comes to handling UTF files, and it's a difference if you need to handle all sort of files
or just assume UTF-8 file names separated by newlines.
It is easily possible to detect UTF-8 files, especially when we're talking about a list file which usually is way < 1 MiB.MVV wrote:it isn't always possible to detect UTF-8 (e.g. in case of short names), so BOM is useful here.
I'm already doing this in my plugins. The Scintilla based projects also do this since years.
You don't need to tell me how to program things.MVV wrote: It is not hard at all to check if first three bytes of file matches the BOM and increase pointer by 3, and continue processing just like there are no BOM (if converting to UTF-16 is not required - however there are no standard functions or API that accepts UTF-8 w/o conversion).
As already mentioned, I have my reasons, e.g. Java handles UTF-8 and conversions quite well.
It does, have you even tried? It was just an example BTW.MVV wrote: And I doubt that cmd.exe will process UTF-8 encoded lists w/o BOM
That's just your opinion, I've seen some programs handling them.MVV wrote:because most programs accept ANSI or UTF-16 parameters, but not UTF-8...
Also we're not talking about parameters, but a list file, which is read separately, not the command line.
I say it again: I have my own programs and reasons for it,
and I'm sure a lot of other people too.
And to quote Wikipedia:
Have you ever worked with complex PHP scripts and XML?The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use...Byte order has no meaning in UTF-8...
You rarely see BOMs there, because they hinder parsing the same way as I mentioned earlier, modern parsers don't need a BOM.
So, I agree that the WinAPI basically just knows ANSI and UTF-16
(which is not exactly true, MS did support MBCS until some years ago),
but that doesn't mean there is no use for the UTF-8 variant,
otherwise we could just drop that option according to your point.
All I say is that the UTF-8 list file needs an optional BOM-less variant, because it's less trouble, that's all.
It neither hurts nor does it affect the already existing parameters.
Actually I'm not against your idea, I find it useful, both %W and %U should be processed this way, not only %U.
It is easily possible to detect UTF-8 files, especially when we're talking about a list file which usually is way < 1 MiB.
I'm already doing this in my plugins. The Scintilla based projects also do this since years.
There are cases where it is impossible to detect encoding w/o any markers or tags. What if you have one short filename with sequence that may be a correct part of UTF-8?Have you ever worked with complex PHP scripts and XML?
You rarely see BOMs there, because they hinder parsing the same way as I mentioned earlier, modern parsers don't need a BOM.
Of course cmd.exe will read UTF-8 files, but it will treat them as ASCII and won't be able to process files itself, only pass names to third-party programs that must expect UTF-8 paths.It does, have you even tried?
Rarely. For big files and insufficient buffers maybe.MVV wrote:There are cases where it is impossible to detect encoding w/o any markers or tags
You just need to check for Bytes > 0x7F, and if you have some, check if they're in correct UTF-8 order. That's it.
(I'm just talking of UTF-8 of course, UTF-16 without BOM is sth. very different)
Short file names? You mean 8.3 names like FILENA~1.TXT ? Why should this be a problem?MVV wrote:What if you have one short filename with sequence that may be a correct part of UTF-8
If you haven't detected bytes >= 0x80 you send it to WinAPI function GetFullPathName or similar (and are limited to ANSI and MAX_PATH of course).
If you have, convert it to UTF-16 and use GetFullPathNameW.
But seriously, why would I still do this and use short names nowadays?
A program that can't handle long names is very likely not able to handle Unicode in any form too.
Sure, but that's the whole point of list files in most situations.MVV wrote:Of course cmd.exe will read UTF-8 files, but it will treat them as ASCII and won't be able to process files itself, only pass names to third-party programs that must expect UTF-8 paths.
Either let some scripting mechanism split the file for you and loop a program call, or simple pass the list file path to your desired program and let it parse on it's own.
I don't mean DOS names, I mean normal short filenames that contain characters with codes 0x80-0xFF. Correct UTF-8 sequences don't guarantee that this file is in UTF-8, moreover short names contain less sequences.milo1012 wrote:Short file names? You mean 8.3 names like FILENA~1.TXT ? Why should this be a problem?
If you haven't detected bytes >= 0x80 you send it to WinAPI function GetFullPathName or similar (and are limited to ANSI and MAX_PATH of course).
Now I see what you mean.MVV wrote:...short filenames that contain characters with codes 0x80-0xFF
But do you really care about that tiny probability of having ANSI bytes like
110xxxxx 10xxxxxx
accidentally forming a valid UTF-8 code?
That's why you check not just a few lines, but a large portion.
I agree, when having maybe just one file name/path with just one occurrence like above, it may be possible.
But seriously, the chances are thin.
(I could calculate them, for a 2 byte code point we're probably around 1 in 65535...more or less)
As soon as you have a second code point that does not form a valid UTF-8,
the detection fails and the file is correctly recognized as ANSI.
Anyway, for such few-byte-situations you can always expect trouble (Bush hid the facts).
The point was that my program expects UTF-8, where I don't want or need a BOM, so the automatic detection was just optional.
I agree that chances are thin, but they exist. Also I don't like the idea to read up to entire file (e.g. if it doesn't contain 0x80-0xFF) in order to detect encoding. Of course it is not a problem if you pass list in the only encoding that program expects (or pass encoding as parameter like for 7z.exe).
BTW it is easy to cut BOM with batch file:
BTW it is easy to cut BOM with batch file:
Code: Select all
@echo off
setlocal enabledelayedexpansion
for /f "usebackq" %%f in (`type %1`) do (
set curname=%%~f
echo "!curname:п»ї=!"
)
pause
If you're going to read/parse the file with your program anyway (afterwards),MVV wrote:I don't like the idea to read up to entire file
the file is cached and the performance penalty is virtually non-existent,
especially with today's processing power and assuming the list file size is way below file caching limit, even on legacy systems.
Well, "easy" is a subjective term.MVV wrote:BTW it is easy to cut BOM with batch file
I don't like that sticky string processing and delayed expansion on batch files, it's even more complicated than simply calling my BOM remover.
It seems you posted the BOM from your System code page 1251, so pasting the code on non-1251 systems (like mine) doesn't work.
Should be  on 1252 - or - whatever 0xEFBBBF looks like on your system ANSI page.
It's working, but not for files containing spaces in them (because of missing delimiter and type).
Also why removing surrounding quotes, the list file never has them.
Actually I didn't want to continue my example, but I swiftly modified it now to see if it works without flaws....
It does so far, even with CJK symbols (due to chcp). (Button Bar parameters: %UF "%P")
The only thing that is messed up are files ending with an exclamation mark.
It's removed for some reason.
( file1!.txt becomes file1.txt )
Any idea why and how it can be prevented?
Code: Select all
@echo off
chcp 65001
setlocal enabledelayedexpansion
for /F "usebackq delims=" %%f in ("%~1") do (
set curname=%%f
set curname=!curname:=!
"D:\tools\mkvtools\mkvextract.exe" tracks "%~2!curname!" 2:"%~2!curname! track1.ogg"
)
pause
Agree, but it is compact, and it is better than nothing.Well, "easy" is a subjective term.
You're right, my locale is 1251, after saving BOM to file it will work with any locale, however when I paste it to forum, Unicode prevents it from correct conversion.Should be  on 1252 - or - whatever 0xEFBBBF looks like on your system ANSI page.
I usually use %~ because it is a generic way that works with strings that may be quoted too.Also why removing surrounding quotes, the list file never has them.
Quoting loop variable should help to process files with spaces in names.
I don't like delayed expansion too, but sometimes it is useful. Exclamation marks are special characters when delayed expansion is used, I've stumbled over this thing recently, they must be escaped (with ^ as usual for batch; more than once if used outside of quoted strings). Yes, there are problematic characters for batch processor, however there are such characters for programs too (e.g. some programs treat \" in a quoted strings as just quote so can't parse paths like "C:\").
You know, problem may be solved w/o delayed expansion, using calls:
Code: Select all
@echo off
for /f "usebackq" %%f in (`type %1`) do call :loop_body "%%~f" %2 || goto after_loop
:after_loop
pause
goto :EOF
:loop_body
set "curname=%~1"
set "curname=%curname:=%"
echo "D:\tools\mkvtools\mkvextract.exe" tracks "%~2%curname%" 2:"%~2%curname% track1.ogg"
rem Use exit /b 1 to break loop
exit /b 0