LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Dalai · Post by *Dalai » 2023-11-17, 15:12 UTC

AntonyD wrote: 2023-11-17, 14:50 UTCThose. OR the plugin correctly processes all test files, including information on encodings and markers.

Well, it doesn't detect any file encoding and it doesn't claim to do that. Hence the field is called "BOM Type" and not "Encoding" or similar.

For example, ONLY for line breaks. And it doesn’t even try to determine anything related to encodings.

In my opinion it does exactly that. It's required to check for a UTF-16 BOM to know how to correctly interpret the following stream of bytes. If it didn't check for a BOM, UTF-16 files would be detected as binary, even they have a BOM. When the plugin knows a file has a BOM, why not return that information in a field?

UTF-16BE-noBOM; UTF-16LE-noBOM
these 2 TEXT files were detected as binary.

Correct. There is no way to tell such files apart from binary files without more sophisticated detection algorithms. That's the thing with "checking if a file is text or binary". You have to be told how to interpret the data, which is exactly what a BOM does. If there is no BOM, there's no way of knowing. And I'm not going to implement statistical analysis or something. BTW, UTF-8 doesn't have this problem because there's only a single endianness (LE).

I have an idea how to add UTF-32 support, but I won't promise anything just yet.

Regards
Dalai

Fla$her · Post by *Fla$her » 2023-11-17, 16:51 UTC

Dalai wrote: 2023-11-16, 22:40 UTC We'll see if I expand the BOM detection in the future. UTF-16 detection was important because of the difference in bytes per line-break character compared to ANSI/UTF-8.

And why is there a line break if the quote was about BOM? The marker is located simply and quickly. So I ask, why not look for all possible BOMs, regardless of the rarity of the encodings given in my link?

Dalai · Post by *Dalai » 2023-11-17, 17:26 UTC

Though I'm aware it's not an easy topic, I'm wondering if I'm expressing myself poorly.

Let's assume that my plugin would check for the existence of UTF-7 and UTF-1 BOMs. Now what about the line break count of such files? Don't you think that users are right to assume the line break count to be correct? I can't count any line breaks if I don't know how to interpret a stream of bytes, it's that simple. It's true that detecting a BOM is fast and simple. It's also the prerequisite to counting line breaks, I think you'll agree on that. But detecting a BOM without a correct line break count would be ... silly, or even stupid IMO.

And to add to that, I need to be able to verify anything I implement. Currently I can't even do that for UTF-32 because I don't have a freeware (or otherwise non-paid) editor that can deal with such an encoding. I'm going to get one, but that takes time.

Regards
Dalai

Horst.Epp · Post by *Horst.Epp » 2023-11-17, 17:31 UTC

Dalai wrote: 2023-11-17, 17:26 UTC And to add to that, I need to be able to verify anything I implement. Currently I can't even do that for UTF-32 because I don't have a freeware (or otherwise non-paid) editor that can deal with such an encoding. I'm going to get one, but that takes time.

AkelPad supports UTF32 LE and BE

I like it as an Editor, and there is even a good Lister plugin for it.

Fla$her · Post by *Fla$her » 2023-11-17, 22:13 UTC

Dalai wrote: 2023-11-17, 17:26 UTCDon't you think that users are right to assume the line break count to be correct?

Where you can't use a counter, you can return 'Undefined'.
BOM is a separate option that has nothing to do with line breaks.

Dalai wrote: 2023-11-17, 17:26 UTCIt's also the prerequisite to counting line breaks, I think you'll agree on that.

Didn't quite understand what is a prerequisite?

Dalai wrote: 2023-11-17, 17:26 UTCBut detecting a BOM without a correct line break count would be ... silly, or even stupid IMO.

I completely disagree. Field data can coexist within the plugin, without obliging the user to combine them in columns or tooltips.

Dalai · Post by *Dalai » 2023-11-18, 00:06 UTC

Fla$her wrote: 2023-11-17, 22:13 UTCWhere you can't use a counter, you can return 'Undefined'.

If I did that, the first question that would pop up here would be "Why does it return 'Undefined' for file X?". That question would be justified, but it can be avoided which I intend to do.

BOM is a separate option that has nothing to do with line breaks.

Well, it does in the code of my plugin. And I don't want to add additional code just to handle an exotic edge case.

Didn't quite understand what is a prerequisite?

Possible synonyms: requirement, precondition.

I completely disagree. Field data can coexist within the plugin, without obliging the user to combine them in columns or tooltips.

How a user uses the data provided by my plugin is not my concern. But I intend to provide complete data for all fields it does return. I either fully support file types with a BOM and line break count, or I don't support it at all.

Regards
Dalai

Fla$her · Post by *Fla$her » 2023-11-18, 01:24 UTC

Dalai wrote: 2023-11-18, 00:06 UTCThat question would be justified, but it can be avoided which I intend to do.

The truth is that you have not avoided my question. No one will get better from rearranging the questions in places.

Dalai wrote: 2023-11-18, 00:06 UTCWell, it does in the code of my plugin.

It's not about the code, it's about how you try to relate one to the other.

Dalai wrote: 2023-11-18, 00:06 UTCPossible synonyms: requirement, precondition.

I didn't ask about the meaning of the word, but about what it indicates in the preceding text.

By the way, gvim also works with UTF-32. It's free.

Dalai · Post by *Dalai » 2023-11-18, 11:36 UTC

Fla$her wrote: 2023-11-18, 01:24 UTCThe truth is that you have not avoided my question. No one will get better from rearranging the questions in places.

What are you even talking about?

It's not about the code, it's about how you try to relate one to the other.

And why would it be wrong to relate BOMs to line breaks when both of them are returned for each file? Users can (and probably will) expect the values to be correct - and rightfully so. After all, the plugin isn't called BOMinfo or something but LineBreakInfo.

I didn't ask about the meaning of the word, but about what it indicates in the preceding text.

Well, read the two sentences again:

Dalai wrote:It's true that detecting a BOM is fast and simple. It's also the prerequisite to counting line breaks, I think you'll agree on that.

I don't know how to phrase that any simpler.

Regards
Dalai

Dalai · Post by *Dalai » 2023-11-18, 13:45 UTC

Horst.Epp wrote: 2023-11-17, 17:31 UTCAkelPad supports UTF32 LE and BE

Fla$her wrote: 2023-11-18, 01:24 UTCBy the way, gvim also works with UTF-32. It's free.

Thank you both. I'm going with EditPad Lite for now. I did use many years ago before I switched to Notepad++.

I've implemented UTF-32 support. Currently I'm testing it extensively, and it looks good so far.

Regards
Dalai

Fla$her · Post by *Fla$her » 2023-11-18, 19:48 UTC

Dalai wrote: 2023-11-18, 11:36 UTCWhat are you even talking about?

That the argument from the quote is unconvincing.

Dalai wrote: 2023-11-18, 11:36 UTCUsers can (and probably will) expect the values to be correct - and rightfully so.

The values for BOM of unsupported encodings are now incorrect, which users do not expect.

Dalai wrote: 2023-11-18, 11:36 UTCAfter all, the plugin isn't called BOMinfo or something but LineBreakInfo.

A convenient argument, not to dispute.

Dalai wrote: 2023-11-18, 11:36 UTCI don't know how to phrase that any simpler.

I'm not asking for a reformulation. I asked you to answer the question — what exactly is a condition? Speed and simplicity? Or what?

Dalai wrote: 2023-11-18, 13:45 UTC I've implemented UTF-32 support. Currently I'm testing it extensively, and it looks good so far.

+1 good news.

Dalai · Post by *Dalai » 2023-11-18, 21:35 UTC

Fla$her wrote: 2023-11-18, 19:48 UTCThe values for BOM of unsupported encodings are now incorrect, which users do not expect.

The plugin never claimed to support any BOM types other than the ones documented - right in the plugin (it's a multiple choice field) and in the readme. What gave you any other impression? Just the fact that it shows "None" for e.g. UTF-7?

I'm not asking for a reformulation. I asked you to answer the question — what exactly is a condition? Speed and simplicity? Or what?

I don't see what these questions have to do with the original statement that "detecting BOMs is a prerequisite for counting line breaks".

Regards
Dalai

Fla$her · Post by *Fla$her » 2023-11-18, 21:51 UTC

Dalai wrote: 2023-11-18, 21:35 UTC Just the fact that it shows "None" for e.g. UTF-7?

Exactly.

Dalai wrote: 2023-11-18, 21:35 UTC I don't see what these questions have to do with the original statement that "detecting BOMs is a prerequisite for counting line breaks".

This is not an original statement. Here is the original:

Fla$her wrote: 2023-11-17, 22:13 UTC
Dalai wrote: 2023-11-17, 17:26 UTCIt's also the prerequisite to counting line breaks, I think you'll agree on that.
Didn't quite understand what is a prerequisite?

You didn't specify that it's about BOM Detection. It's not clear from the context. Finally, the answer is received.

Post by *petermad » 2023-11-19, 11:06 UTC

No Here is the original statement:

Dalai wrote: 2023-11-17, 17:26 UTC It's true that detecting a BOM is fast and simple. It's also the prerequisite to counting line breaks

the second "it's" is referring to the previous sentence, explaining why BOM is detected.

So I read it as: It is true that detecting BOM is fast and simple AND it is the prerequisite to counting line breaks.

Under-quoting is deceiving!

Dalai · Post by *Dalai » 2023-11-19, 11:49 UTC

Fla$her wrote: 2023-11-18, 21:51 UTC
Dalai wrote: 2023-11-18, 21:35 UTC Just the fact that it shows "None" for e.g. UTF-7?
Exactly.

I can't support every exotic, obsolete or niche BOM type out there. If a new BOM pops up next month, the field will also show "None", so there always will be cases where the output is wrong. Maybe I should explain somewhere that "None" means "None supported by the plugin" or "None that is known to the plugin". I could change "None" to "None/Unknown" but that would look kind of silly, especially for ANSI and binary files. IMO common sense should tell people that not every piece of software supports every aspect of something. Well, whatever I do, it's wrong for some people...

petermad wrote: 2023-11-19, 11:06 UTCSo I read it as: It is true that detecting BOM is fast and simple AND it is the prerequisite to counting line breaks.

Yes, that's what I meant.

Regards
Dalai

AntonyD · Post by *AntonyD » 2023-11-19, 14:32 UTC

2Dalai
I promised not to interfere more, but in connection with the fact that you manage to implement UTF-32 support - Well, I got to ask - will you manage to separate UTF16 LE and UTF-32 LE recognition? Considering that piece of C++ code I suggested you try?

Total Commander

LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences

Re: LineBreakInfo - Content plugin for information about line break type, BOM type, number of CR/LF/CRLF occurrences