LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Fla$her · Post by *Fla$her » 2023-11-16, 20:16 UTC

AntonyD wrote: 2023-11-16, 18:31 UTC It is understood that the BOM marker is ONLY related to UTF-8 encoded files!!!

Dalai · Post by *Dalai » 2023-11-16, 21:24 UTC

AntonyD wrote: 2023-11-16, 18:25 UTCHere it is my translation in Russian;)

Thanks. Have included that in the .lng file and will upload a new archive once the issue about the HTML UTF-8 is clear.

AntonyD wrote: 2023-11-16, 18:27 UTCAs a matter of fact - that’s why I asked you to provide at least one example of such a file - where you\your plug would find these commixed line breaks.

Well, as I said, you can create such files yourself by combining files with different line break types into one file. To make things easier and because the ApacheHaus.com domain is down right now (domain name can't be properly resolved by most DNS servers), I've uploaded a test file here.

Regarding the BOM the plugin works like this:

It checks the first two bytes if they're equal to a UTF-16 LE/BE BOM. If any of them match, either "UTF-16 LE" or "UTF-16 BE" is returned.
Next it checks the first three bytes if they're equal to the UTF-8 BOM. If so, it returns "UTF-8".
Everything else is considered as having no BOM and the plugin returns "None". This also applies to UTF-8 encoded files without a BOM. And as I said, encoding of characters is not checked and irrelevant.

There are several examples of files having a "BOM Type" other than "None" in my screenshot. Maybe the most known among them are .reg files for Windows systems; they're usually UTF-16 LE.

Regards
Dalai

Fla$her · Post by *Fla$her » 2023-11-16, 22:06 UTC

Dalai wrote: 2023-11-16, 21:24 UTC Regarding the BOM the plugin works like this:

Why don't you want to universalize the plugin for rarer encodings?

Thanks for the work, by the way.

Dalai · Post by *Dalai » 2023-11-16, 22:40 UTC

We'll see if I expand the BOM detection in the future. UTF-16 detection was important because of the difference in bytes per line-break character compared to ANSI/UTF-8. I had UTF-32 detection in there for a short period of time, but there are quite a lot of combinations possible of where a CR might be at the end of the buffer. And since it doesn't have much practical relevance I removed it.

Regards
Dalai

oOoZEUSoOo · Post by *oOoZEUSoOo » 2023-11-17, 02:56 UTC

2Dalai,

Thank you for your work ^^

Here is a French translation :

With complete words but take much space...

Code: Select all

[fra]
;--- Line Breaks
Line Breaks=Sauts de ligne
N/A|Binary|LF|CR|CRLF|Mixed=Aucun|Binaire|LF|CR|CRLF|Mixte

;--- BOM Type
BOM Type=Type de BOM (Byte Order Mark)
None|UTF-8|UTF-16 LE|UTF-16 BE=Aucun|UTF-8|UTF-16 LE|UTF-16 BE

Binary Count=Nombre de caractères binaires
CR Count=Nombre de CR
LF Count=Nombre de LF
CRLF Count=Nombre de CRLF
FF Count=Nombre de FF
VT Count=Nombre de VT
Bytes Read=Octets Lus

With words contraction taking less space...

Code: Select all

[fra]
;--- Line Breaks
Line Breaks=Sauts de ligne
N/A|Binary|LF|CR|CRLF|Mixed=Aucun|Binaire|LF|CR|CRLF|Mixte

;--- BOM Type
BOM Type=Type de BOM (Byte Order Mark)
None|UTF-8|UTF-16 LE|UTF-16 BE=Aucun|UTF-8|UTF-16 LE|UTF-16 BE

Binary Count=Nbre Car binaires
CR Count=Nbre CR
LF Count=Nbre LF
CRLF Count=Nbre CRLF
FF Count=Nbre FF
VT Count=Nbre VT
Bytes Read=Octets Lus

I'm using the second one.

Best regards.

AntonyD · Post by *AntonyD » 2023-11-17, 08:09 UTC

I had UTF-32 detection in there for a short period of time, but there are quite a lot of combinations possible of where a CR might be at the end of the buffer. And since it doesn't have much practical relevant I removed it.

And looks like because of that I've got this results:
file_UTF-32BE.txt = Line breaks(Binary file), BOM-marker(NONE)
file_UTF-32LE.txt = Line breaks(Mixed types), BOM-marker(UTF-16LE)

These files in fact were absolutely the same txt file with pseudo-content with proper WIN line breaks (CRLF)
And for encodings UTF-8, UTF-16LE, UTF-16BE - detection was also partially correct:
Line breaks(CRLF) - were detected for all 3 files properly, but this "strange" BOM-marker...
UTF-8 - was detected as NONE. Only UTF-16LE and UTF-16BE accordingly were detected properly.

AntonyD · Post by *AntonyD » 2023-11-17, 08:15 UTC

Ouch, Looks like Lister does not able to detect and correctly render content of UTF-32LE/BE encoded files!
IMHO - it's bug!

Post by *ghisler(Author) » 2023-11-17, 10:33 UTC

Lister does not support UTF-16 without BOM.

Dalai · Post by *Dalai » 2023-11-17, 10:47 UTC

2white
Thanks for moving the posts.

2oOoZEUSoOo
Thanks. I think I'm going to include both translations - the shorter one as [fra] and the longer one as [fra2] - so users can switch to either of them if they really want to.

AntonyD wrote: 2023-11-17, 08:09 UTCThese files in fact were absolutely the same txt file with pseudo-content with proper WIN line breaks (CRLF)

That may be the case, but UTF-32 uses four bytes per character (or code point), and a lot of them are probably null bytes. Hence UTF-32 BE files are detected as binary. The BOMs of UTF16 LE and UTF-32 LE are ambiguous in the first two bytes (they're identical), thus UTF-32 LE files are detected as UTF-16 LE.

UTF-8 - was detected as NONE.

Then the file has no BOM. Please keep in mind that Lister and a lot of editors check the file's encoding, and may choose to use UTF-8 based on their detection. But UTF-8 is not the same as UTF-8 with BOM. And this plugin only cares about the BOM, as I said several times now.

Regards
Dalai

AntonyD · Post by *AntonyD » 2023-11-17, 11:53 UTC

2ghisler(Author)

Lister does not support UTF-16 without BOM.

This explanation was made to what? I did not write anything strange about the encoding UTF-16 )))
The problem was found only for UTF-32 variants!
But after your clarification I checked it and yes - Lister shows nothing correctly for UTF-16BE w/o BOM.
BUT - UTF-16LE w/o BOM - it shows perfectly well.

I.e. we have 5 problems with displaying by Lister text files?
For UTF-16LE w/o BOM, UTF-32LE, UTF-32BE, UTF-32LE w/o BOM, UTF-32BE w/o BOM,
How do you plan to fix that?

AntonyD · Post by *AntonyD » 2023-11-17, 11:55 UTC

2Dalai

Then the file has no BOM.

It has it.
https://pixeldrain.com/u/ZNKpj5D3

https://ibb.co/8sRSS33
that's what you should get also during the testing phase.

Dalai · Post by *Dalai » 2023-11-17, 12:19 UTC

AntonyD wrote: 2023-11-17, 11:55 UTCIt has it.
https://pixeldrain.com/u/ZNKpj5D3

https://ibb.co/8sRSS33
that's what you should get also during the testing phase.

I get the same results, and all of them are as I expect them. Files without a BOM are detected as having no BOM - which is correct. UTF-32 BE files are detected as binary and UTF-32 LE files as UTF-16 LE and I explained the reasons above.

So which of these results is wrong in your opinion?

Regards
Dalai

AntonyD · Post by *AntonyD » 2023-11-17, 13:36 UTC

If I read what you explain, it turns out that the meaning of the plugin is very well lost((((
Because any discrepancies with expectations simply receive their justification, but the expectation does not become better from this.
for ex.:

The BOMs of UTF16 LE and UTF-32 LE are ambiguous in the first two bytes (they're identical), thus UTF-32 LE files are detected as UTF-16 LE.

very simple explanation - But what about the fact that I expect to see the ACCURATE data?
After all, full-fledged text editors were able to determine (when I opened the same files) their data - and do that accurate!
If you don't control the definition of the encoding, then what's the point of fields with half-correct information?
In this case, only the calculation and type of line breaks should be done by the plugin, IMHO.
Well, what's the use of displaying UTF16 LE for in fact a UTF-32 LE file?

Dalai · Post by *Dalai » 2023-11-17, 13:54 UTC

AntonyD wrote: 2023-11-17, 13:36 UTCvery simple explanation - But what about the fact that I expect to see the ACCURATE data?

Let me ask again: Which of these results is wrong or inaccurate in your opinion? And let's leave out UTF-32 as that isn't supported and not of much relevance anyway.

After all, full-fledged text editors were able to determine (when I opened the same files) their data - and do that accurate!

Yes, as I said Lister and proper editors detect the file's encoding based on their content in addition to the BOM, but my plugin only does the latter. I don't intend to implement an encoding check because there is already a plugin for that: EncInfo.

If you don't control the definition of the encoding, then what's the point of fields with half-correct information?

Which of the provided information is only half correct? Again, excluding UTF-32.

Regards
Dalai

AntonyD · Post by *AntonyD » 2023-11-17, 14:50 UTC

Which of these results is wrong or inaccurate in your opinion? And let's leave out UTF-32 as that isn't supported and not of much relevance anyway.

In this sentence, my rejections of realities collide.
Those. OR the plugin correctly processes all test files, including information on encodings and markers.
OR the plugin is sharpened for something very valuable and unique.
For example, ONLY for line breaks. And it doesn’t even try to determine anything related to encodings.

Which of the provided information is only half correct?

UTF-16BE-noBOM; UTF-16LE-noBOM
these 2 TEXT files were detected as binary.

Well, okay. I won't bother you anymore.

But lastly: What about this strategy for UTF16 LE and UTF-32 LE:

Code: Select all

bool isUTF16LE(const std::string& filePath) {
    std::ifstream file(filePath, std::ios::binary);
    
    if (file.is_open()) {
        // Read the first few bytes to analyze the BOM
        char bom[2];
        file.read(bom, 2);

        // Check for the UTF-16 LE BOM (FF FE)
        if (bom[0] == static_cast<char>(0xFF) && bom[1] == static_cast<char>(0xFE)) {
            // Check for null bytes to distinguish between UTF-16 LE and UTF-32 LE
            char buffer[2];
            while (file.read(buffer, 2)) {
                if (buffer[0] == 0x00 && buffer[1] != 0x00) {
                    file.close();
                    return true; // UTF-16 LE
                }
            }
        }
    }

    file.close();
    return false;
}

bool isUTF32LE(const std::string& filePath) {
    std::ifstream file(filePath, std::ios::binary);

    if (file.is_open()) {
        // Read the first few bytes to analyze the BOM
        char bom[2];
        file.read(bom, 2);

        // Check for the UTF-32 LE BOM (FF FE 00 00)
        if (bom[0] == static_cast<char>(0xFF) && bom[1] == static_cast<char>(0xFE)) {
            // Check for null bytes to distinguish between UTF-16 LE and UTF-32 LE
            char buffer[4];
            while (file.read(buffer, 4)) {
                if (buffer[0] == 0x00 && buffer[2] == 0x00) {
                    file.close();
                    return true; // UTF-32 LE
                }
            }
        }
    }

    file.close();
    return false;
}

    if (isUTF16LE(filePath)) {
        //blah-blah;
    } else if (isUTF32LE(filePath)) {
        //blah-blah;
    } else {
        //another things......
    }

Total Commander

LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences