It's not that.AntonyD wrote: 2023-11-16, 18:31 UTC It is understood that the BOM marker is ONLY related to UTF-8 encoded files!!!
LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Moderators: Hacker, petermad, Stefan2, white
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Overquoting is evil! 👎
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Thanks. Have included that in the .lng file and will upload a new archive once the issue about the HTML UTF-8 is clear.
Well, as I said, you can create such files yourself by combining files with different line break types into one file. To make things easier and because the ApacheHaus.com domain is down right now (domain name can't be properly resolved by most DNS servers), I've uploaded a test file here.AntonyD wrote: 2023-11-16, 18:27 UTCAs a matter of fact - that’s why I asked you to provide at least one example of such a file - where you\your plug would find these commixed line breaks.
Regarding the BOM the plugin works like this:
- It checks the first two bytes if they're equal to a UTF-16 LE/BE BOM. If any of them match, either "UTF-16 LE" or "UTF-16 BE" is returned.
- Next it checks the first three bytes if they're equal to the UTF-8 BOM. If so, it returns "UTF-8".
- Everything else is considered as having no BOM and the plugin returns "None". This also applies to UTF-8 encoded files without a BOM. And as I said, encoding of characters is not checked and irrelevant.
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Why don't you want to universalize the plugin for rarer encodings?
Thanks for the work, by the way.
Overquoting is evil! 👎
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
We'll see if I expand the BOM detection in the future. UTF-16 detection was important because of the difference in bytes per line-break character compared to ANSI/UTF-8. I had UTF-32 detection in there for a short period of time, but there are quite a lot of combinations possible of where a CR might be at the end of the buffer. And since it doesn't have much practical relevance I removed it.
Regards
Dalai
Regards
Dalai
Last edited by Dalai on 2023-11-17, 09:25 UTC, edited 1 time in total.
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
- oOoZEUSoOo
- Junior Member
- Posts: 60
- Joined: 2021-07-09, 18:26 UTC
- Location: France
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
2Dalai,
Thank you for your work ^^
Here is a French translation :
With complete words but take much space...
With words contraction taking less space...
I'm using the second one.
Best regards.
Thank you for your work ^^
Here is a French translation :
With complete words but take much space...
Code: Select all
[fra]
;--- Line Breaks
Line Breaks=Sauts de ligne
N/A|Binary|LF|CR|CRLF|Mixed=Aucun|Binaire|LF|CR|CRLF|Mixte
;--- BOM Type
BOM Type=Type de BOM (Byte Order Mark)
None|UTF-8|UTF-16 LE|UTF-16 BE=Aucun|UTF-8|UTF-16 LE|UTF-16 BE
Binary Count=Nombre de caractères binaires
CR Count=Nombre de CR
LF Count=Nombre de LF
CRLF Count=Nombre de CRLF
FF Count=Nombre de FF
VT Count=Nombre de VT
Bytes Read=Octets Lus
Code: Select all
[fra]
;--- Line Breaks
Line Breaks=Sauts de ligne
N/A|Binary|LF|CR|CRLF|Mixed=Aucun|Binaire|LF|CR|CRLF|Mixte
;--- BOM Type
BOM Type=Type de BOM (Byte Order Mark)
None|UTF-8|UTF-16 LE|UTF-16 BE=Aucun|UTF-8|UTF-16 LE|UTF-16 BE
Binary Count=Nbre Car binaires
CR Count=Nbre CR
LF Count=Nbre LF
CRLF Count=Nbre CRLF
FF Count=Nbre FF
VT Count=Nbre VT
Bytes Read=Octets Lus
Best regards.
Registered User. Total Commander : The best file manager...
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
And looks like because of that I've got this results:I had UTF-32 detection in there for a short period of time, but there are quite a lot of combinations possible of where a CR might be at the end of the buffer. And since it doesn't have much practical relevant I removed it.
file_UTF-32BE.txt = Line breaks(Binary file), BOM-marker(NONE)
file_UTF-32LE.txt = Line breaks(Mixed types), BOM-marker(UTF-16LE)
These files in fact were absolutely the same txt file with pseudo-content with proper WIN line breaks (CRLF)
And for encodings UTF-8, UTF-16LE, UTF-16BE - detection was also partially correct:
Line breaks(CRLF) - were detected for all 3 files properly, but this "strange" BOM-marker...
UTF-8 - was detected as NONE. Only UTF-16LE and UTF-16BE accordingly were detected properly.
Last edited by AntonyD on 2023-11-17, 08:16 UTC, edited 1 time in total.
#146217 personal license
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Ouch, Looks like Lister does not able to detect and correctly render content of UTF-32LE/BE encoded files!
IMHO - it's bug!
IMHO - it's bug!
#146217 personal license
- ghisler(Author)
- Site Admin
- Posts: 50386
- Joined: 2003-02-04, 09:46 UTC
- Location: Switzerland
- Contact:
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Lister does not support UTF-16 without BOM.
Author of Total Commander
https://www.ghisler.com
https://www.ghisler.com
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
2white
Thanks for moving the posts.
2oOoZEUSoOo
Thanks. I think I'm going to include both translations - the shorter one as [fra] and the longer one as [fra2] - so users can switch to either of them if they really want to.
Regards
Dalai
Thanks for moving the posts.
2oOoZEUSoOo
Thanks. I think I'm going to include both translations - the shorter one as [fra] and the longer one as [fra2] - so users can switch to either of them if they really want to.
That may be the case, but UTF-32 uses four bytes per character (or code point), and a lot of them are probably null bytes. Hence UTF-32 BE files are detected as binary. The BOMs of UTF16 LE and UTF-32 LE are ambiguous in the first two bytes (they're identical), thus UTF-32 LE files are detected as UTF-16 LE.AntonyD wrote: 2023-11-17, 08:09 UTCThese files in fact were absolutely the same txt file with pseudo-content with proper WIN line breaks (CRLF)
Then the file has no BOM. Please keep in mind that Lister and a lot of editors check the file's encoding, and may choose to use UTF-8 based on their detection. But UTF-8 is not the same as UTF-8 with BOM. And this plugin only cares about the BOM, as I said several times now.UTF-8 - was detected as NONE.
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
2ghisler(Author)
The problem was found only for UTF-32 variants!
But after your clarification I checked it and yes - Lister shows nothing correctly for UTF-16BE w/o BOM.
BUT - UTF-16LE w/o BOM - it shows perfectly well.
I.e. we have 5 problems with displaying by Lister text files?
For UTF-16LE w/o BOM, UTF-32LE, UTF-32BE, UTF-32LE w/o BOM, UTF-32BE w/o BOM,
How do you plan to fix that?
This explanation was made to what? I did not write anything strange about the encoding UTF-16 )))Lister does not support UTF-16 without BOM.
The problem was found only for UTF-32 variants!
But after your clarification I checked it and yes - Lister shows nothing correctly for UTF-16BE w/o BOM.
BUT - UTF-16LE w/o BOM - it shows perfectly well.
I.e. we have 5 problems with displaying by Lister text files?
For UTF-16LE w/o BOM, UTF-32LE, UTF-32BE, UTF-32LE w/o BOM, UTF-32BE w/o BOM,
How do you plan to fix that?
#146217 personal license
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
2Dalai
https://pixeldrain.com/u/ZNKpj5D3
https://ibb.co/8sRSS33
that's what you should get also during the testing phase.
It has it.Then the file has no BOM.
https://pixeldrain.com/u/ZNKpj5D3
https://ibb.co/8sRSS33
that's what you should get also during the testing phase.
#146217 personal license
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
I get the same results, and all of them are as I expect them. Files without a BOM are detected as having no BOM - which is correct. UTF-32 BE files are detected as binary and UTF-32 LE files as UTF-16 LE and I explained the reasons above.AntonyD wrote: 2023-11-17, 11:55 UTCIt has it.
https://pixeldrain.com/u/ZNKpj5D3
https://ibb.co/8sRSS33
that's what you should get also during the testing phase.
So which of these results is wrong in your opinion?
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
If I read what you explain, it turns out that the meaning of the plugin is very well lost((((
Because any discrepancies with expectations simply receive their justification, but the expectation does not become better from this.
for ex.:
After all, full-fledged text editors were able to determine (when I opened the same files) their data - and do that accurate!
If you don't control the definition of the encoding, then what's the point of fields with half-correct information?
In this case, only the calculation and type of line breaks should be done by the plugin, IMHO.
Well, what's the use of displaying UTF16 LE for in fact a UTF-32 LE file?
Because any discrepancies with expectations simply receive their justification, but the expectation does not become better from this.
for ex.:
very simple explanation - But what about the fact that I expect to see the ACCURATE data?The BOMs of UTF16 LE and UTF-32 LE are ambiguous in the first two bytes (they're identical), thus UTF-32 LE files are detected as UTF-16 LE.
After all, full-fledged text editors were able to determine (when I opened the same files) their data - and do that accurate!
If you don't control the definition of the encoding, then what's the point of fields with half-correct information?
In this case, only the calculation and type of line breaks should be done by the plugin, IMHO.
Well, what's the use of displaying UTF16 LE for in fact a UTF-32 LE file?
#146217 personal license
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
Let me ask again: Which of these results is wrong or inaccurate in your opinion? And let's leave out UTF-32 as that isn't supported and not of much relevance anyway.AntonyD wrote: 2023-11-17, 13:36 UTCvery simple explanation - But what about the fact that I expect to see the ACCURATE data?
Yes, as I said Lister and proper editors detect the file's encoding based on their content in addition to the BOM, but my plugin only does the latter. I don't intend to implement an encoding check because there is already a plugin for that: EncInfo.After all, full-fledged text editors were able to determine (when I opened the same files) their data - and do that accurate!
Which of the provided information is only half correct? Again, excluding UTF-32.If you don't control the definition of the encoding, then what's the point of fields with half-correct information?
Regards
Dalai
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences
In this sentence, my rejections of realities collide.Which of these results is wrong or inaccurate in your opinion? And let's leave out UTF-32 as that isn't supported and not of much relevance anyway.
Those. OR the plugin correctly processes all test files, including information on encodings and markers.
OR the plugin is sharpened for something very valuable and unique.
For example, ONLY for line breaks. And it doesn’t even try to determine anything related to encodings.
UTF-16BE-noBOM; UTF-16LE-noBOMWhich of the provided information is only half correct?
these 2 TEXT files were detected as binary.
Well, okay. I won't bother you anymore.
But lastly: What about this strategy for UTF16 LE and UTF-32 LE:
Code: Select all
bool isUTF16LE(const std::string& filePath) {
std::ifstream file(filePath, std::ios::binary);
if (file.is_open()) {
// Read the first few bytes to analyze the BOM
char bom[2];
file.read(bom, 2);
// Check for the UTF-16 LE BOM (FF FE)
if (bom[0] == static_cast<char>(0xFF) && bom[1] == static_cast<char>(0xFE)) {
// Check for null bytes to distinguish between UTF-16 LE and UTF-32 LE
char buffer[2];
while (file.read(buffer, 2)) {
if (buffer[0] == 0x00 && buffer[1] != 0x00) {
file.close();
return true; // UTF-16 LE
}
}
}
}
file.close();
return false;
}
bool isUTF32LE(const std::string& filePath) {
std::ifstream file(filePath, std::ios::binary);
if (file.is_open()) {
// Read the first few bytes to analyze the BOM
char bom[2];
file.read(bom, 2);
// Check for the UTF-32 LE BOM (FF FE 00 00)
if (bom[0] == static_cast<char>(0xFF) && bom[1] == static_cast<char>(0xFE)) {
// Check for null bytes to distinguish between UTF-16 LE and UTF-32 LE
char buffer[4];
while (file.read(buffer, 4)) {
if (buffer[0] == 0x00 && buffer[2] == 0x00) {
file.close();
return true; // UTF-32 LE
}
}
}
}
file.close();
return false;
}
if (isUTF16LE(filePath)) {
//blah-blah;
} else if (isUTF32LE(filePath)) {
//blah-blah;
} else {
//another things......
}
#146217 personal license