Button Bar: UTF-8 list file w/o BOM

milo1012 · Post by *milo1012 » 2014-04-04, 14:53 UTC

I'm not sure if this has already been suggested (a quick search didn't found anything),
but I would like to have a BOM-less UTF-8 list file for the Button Bar (i.e. besides the %UL and %UF we have now).

I already mentioned it here (in german) some time ago,
which demonstrates the use with cmd.exe where the BOM hinders batch processing.
I'm writing my own programs which get started by the Button Bar and a list file,
where I have to filter the BOM manually every time.

The BOM for UTF-16 files makes sense, but for UTF-8 it's not that common to put it,
since nearly all modern parsers recognize UTF-8 sequences w/o BOM (this includes TC's compare by content !)

I think it would be trivial to add this feature, the only major task would be to find another parameter name.
(Suggestion: %uL, %uF - lower case u)

MVV · Post by *MVV » 2014-04-04, 15:25 UTC

I don't think it is a good idea to parse whole list every time you need to detect if it is UTF-8 encoded or ANSI one, and it isn't always possible to detect UTF-8 (e.g. in case of short names), so BOM is useful here.

It is not hard at all to check if first three bytes of file matches the BOM and increase pointer by 3, and continue processing just like there are no BOM (if converting to UTF-16 is not required - however there are no standard functions or API that accepts UTF-8 w/o conversion).

And I doubt that cmd.exe will process UTF-8 encoded lists w/o BOM because most programs accept ANSI or UTF-16 parameters, but not UTF-8...

milo1012 · Post by *milo1012 » 2014-04-04, 17:45 UTC

I appreciate your comments, but all you say is: you don't need this.
I don't think we need to discuss for why implementing it.
I have a lot of experience when it comes to handling UTF files, and it's a difference if you need to handle all sort of files
or just assume UTF-8 file names separated by newlines.

MVV wrote:it isn't always possible to detect UTF-8 (e.g. in case of short names), so BOM is useful here.

It is easily possible to detect UTF-8 files, especially when we're talking about a list file which usually is way < 1 MiB.
I'm already doing this in my plugins. The Scintilla based projects also do this since years.

MVV wrote: It is not hard at all to check if first three bytes of file matches the BOM and increase pointer by 3, and continue processing just like there are no BOM (if converting to UTF-16 is not required - however there are no standard functions or API that accepts UTF-8 w/o conversion).

You don't need to tell me how to program things.
As already mentioned, I have my reasons, e.g. Java handles UTF-8 and conversions quite well.

MVV wrote: And I doubt that cmd.exe will process UTF-8 encoded lists w/o BOM

It does, have you even tried? It was just an example BTW.

MVV wrote:because most programs accept ANSI or UTF-16 parameters, but not UTF-8...

That's just your opinion, I've seen some programs handling them.
Also we're not talking about parameters, but a list file, which is read separately, not the command line.
I say it again: I have my own programs and reasons for it,
and I'm sure a lot of other people too.
And to quote Wikipedia:

The Unicode Standard permits the BOM in UTF-8, but does not require nor recommend its use...Byte order has no meaning in UTF-8...

Have you ever worked with complex PHP scripts and XML?
You rarely see BOMs there, because they hinder parsing the same way as I mentioned earlier, modern parsers don't need a BOM.

So, I agree that the WinAPI basically just knows ANSI and UTF-16
(which is not exactly true, MS did support MBCS until some years ago),
but that doesn't mean there is no use for the UTF-8 variant,
otherwise we could just drop that option according to your point.

All I say is that the UTF-8 list file needs an optional BOM-less variant, because it's less trouble, that's all.

It neither hurts nor does it affect the already existing parameters.

MVV · Post by *MVV » 2014-04-05, 07:38 UTC

Actually I'm not against your idea, I find it useful, both %W and %U should be processed this way, not only %U.

It is easily possible to detect UTF-8 files, especially when we're talking about a list file which usually is way < 1 MiB.
I'm already doing this in my plugins. The Scintilla based projects also do this since years.

Have you ever worked with complex PHP scripts and XML?
You rarely see BOMs there, because they hinder parsing the same way as I mentioned earlier, modern parsers don't need a BOM.

There are cases where it is impossible to detect encoding w/o any markers or tags. What if you have one short filename with sequence that may be a correct part of UTF-8?

It does, have you even tried?

Of course cmd.exe will read UTF-8 files, but it will treat them as ASCII and won't be able to process files itself, only pass names to third-party programs that must expect UTF-8 paths.

milo1012 · Post by *milo1012 » 2014-04-05, 13:51 UTC

MVV wrote:There are cases where it is impossible to detect encoding w/o any markers or tags

Rarely. For big files and insufficient buffers maybe.
You just need to check for Bytes > 0x7F, and if you have some, check if they're in correct UTF-8 order. That's it.
(I'm just talking of UTF-8 of course, UTF-16 without BOM is sth. very different)

MVV wrote:What if you have one short filename with sequence that may be a correct part of UTF-8

Short file names? You mean 8.3 names like FILENA~1.TXT ? Why should this be a problem?
If you haven't detected bytes >= 0x80 you send it to WinAPI function GetFullPathName or similar (and are limited to ANSI and MAX_PATH of course).
If you have, convert it to UTF-16 and use GetFullPathNameW.
But seriously, why would I still do this and use short names nowadays?
A program that can't handle long names is very likely not able to handle Unicode in any form too.

MVV wrote:Of course cmd.exe will read UTF-8 files, but it will treat them as ASCII and won't be able to process files itself, only pass names to third-party programs that must expect UTF-8 paths.

Sure, but that's the whole point of list files in most situations.
Either let some scripting mechanism split the file for you and loop a program call, or simple pass the list file path to your desired program and let it parse on it's own.

MVV · Post by *MVV » 2014-04-05, 15:11 UTC

milo1012 wrote:Short file names? You mean 8.3 names like FILENA~1.TXT ? Why should this be a problem?
If you haven't detected bytes >= 0x80 you send it to WinAPI function GetFullPathName or similar (and are limited to ANSI and MAX_PATH of course).

I don't mean DOS names, I mean normal short filenames that contain characters with codes 0x80-0xFF. Correct UTF-8 sequences don't guarantee that this file is in UTF-8, moreover short names contain less sequences.

milo1012 · Post by *milo1012 » 2014-04-05, 17:09 UTC

MVV wrote:...short filenames that contain characters with codes 0x80-0xFF

Now I see what you mean.
But do you really care about that tiny probability of having ANSI bytes like
110xxxxx 10xxxxxx
accidentally forming a valid UTF-8 code?

That's why you check not just a few lines, but a large portion.
I agree, when having maybe just one file name/path with just one occurrence like above, it may be possible.
But seriously, the chances are thin.
(I could calculate them, for a 2 byte code point we're probably around 1 in 65535...more or less)
As soon as you have a second code point that does not form a valid UTF-8,
the detection fails and the file is correctly recognized as ANSI.

Anyway, for such few-byte-situations you can always expect trouble (Bush hid the facts).

The point was that my program expects UTF-8, where I don't want or need a BOM, so the automatic detection was just optional.

MVV · Post by *MVV » 2014-04-05, 19:12 UTC

I agree that chances are thin, but they exist. Also I don't like the idea to read up to entire file (e.g. if it doesn't contain 0x80-0xFF) in order to detect encoding. Of course it is not a problem if you pass list in the only encoding that program expects (or pass encoding as parameter like for 7z.exe).

BTW it is easy to cut BOM with batch file:

Code: Select all

@echo off
setlocal enabledelayedexpansion
for /f "usebackq" %%f in (`type %1`) do (
	set curname=%%~f
	echo "!curname:п»ї=!"
)
pause

milo1012 · Post by *milo1012 » 2014-04-07, 00:51 UTC

MVV wrote:I don't like the idea to read up to entire file

If you're going to read/parse the file with your program anyway (afterwards),
the file is cached and the performance penalty is virtually non-existent,
especially with today's processing power and assuming the list file size is way below file caching limit, even on legacy systems.

MVV wrote:BTW it is easy to cut BOM with batch file

Well, "easy" is a subjective term.

I don't like that sticky string processing and delayed expansion on batch files, it's even more complicated than simply calling my BOM remover.
It seems you posted the BOM from your System code page 1251, so pasting the code on non-1251 systems (like mine) doesn't work.
Should be ï»¿ on 1252 - or - whatever 0xEFBBBF looks like on your system ANSI page.
It's working, but not for files containing spaces in them (because of missing delimiter and type).
Also why removing surrounding quotes, the list file never has them.

Actually I didn't want to continue my example, but I swiftly modified it now to see if it works without flaws....
It does so far, even with CJK symbols (due to chcp). (Button Bar parameters: %UF "%P")

The only thing that is messed up are files ending with an exclamation mark.
It's removed for some reason.
( file1!.txt becomes file1.txt )
Any idea why and how it can be prevented?

Code: Select all

@echo off
chcp 65001
setlocal enabledelayedexpansion
for /F "usebackq delims=" %%f in ("%~1") do (
  set curname=%%f
  set curname=!curname:ï»¿=!
     "D:\tools\mkvtools\mkvextract.exe" tracks "%~2!curname!" 2:"%~2!curname! track1.ogg"
  )
pause

Anyway, I think we all can see that there is still quite some effort involved to deal with that BOM, no matter what you do in the batch.

MVV · Post by *MVV » 2014-04-07, 06:45 UTC

Well, "easy" is a subjective term.

Agree, but it is compact, and it is better than nothing.

Should be ï»¿ on 1252 - or - whatever 0xEFBBBF looks like on your system ANSI page.

You're right, my locale is 1251, after saving BOM to file it will work with any locale, however when I paste it to forum, Unicode prevents it from correct conversion.

Also why removing surrounding quotes, the list file never has them.

I usually use %~ because it is a generic way that works with strings that may be quoted too.

Quoting loop variable should help to process files with spaces in names.

I don't like delayed expansion too, but sometimes it is useful. Exclamation marks are special characters when delayed expansion is used, I've stumbled over this thing recently, they must be escaped (with ^ as usual for batch; more than once if used outside of quoted strings). Yes, there are problematic characters for batch processor, however there are such characters for programs too (e.g. some programs treat \" in a quoted strings as just quote so can't parse paths like "C:\").

You know, problem may be solved w/o delayed expansion, using calls:

Code: Select all

@echo off
for /f "usebackq" %%f in (`type %1`) do call :loop_body "%%~f" %2 || goto after_loop
:after_loop
pause
goto :EOF

:loop_body
set "curname=%~1"
set "curname=%curname:ï»¿=%"
echo "D:\tools\mkvtools\mkvextract.exe" tracks "%~2%curname%" 2:"%~2%curname% track1.ogg"
rem Use exit /b 1 to break loop
exit /b 0