Search a text file for duplicate text

sitealpha · Post by *sitealpha » 2025-03-03, 07:29 UTC

I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?

Dalai · Post by *Dalai » 2025-03-03, 11:27 UTC

sitealpha wrote: 2025-03-03, 07:29 UTCEach line is a url with a space between lines.

Huh? Lines are by definition separated by line breaks, usually CRLF or just LF (I'm ignoring other breaks such as FF on purpose).

Is there a way to search the file to find duplicate urls?

With TC? Not that I'm aware of, because that would require not just a simple content analysis (does a file contain something) but a more sophisticated approach, relating lines to each other.

Here's what I would do: Sort the file and then let a dedicated tool do its job; maybe both steps can be done with a single tool. On Linux this can be done with a very simple and straightforward command:

Code: Select all

sort -u somefile.txt

If one really wants to use two commands in a pipe, it can be done like this

Code: Select all

sort somefile.txt | uniq

Windows has a Sort(.exe) command, too, but MS apparently didn't see the need to implement such a switch that prints out unique lines...

Maybe Notepad++ with the TextFX plugin can help you. I have that installed and it can sort files, but I'm not sure about unique lines.

Post by *white » 2025-03-03, 12:27 UTC

sitealpha wrote: 2025-03-03, 07:29 UTC I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?

What's the purpose? Do you only want to search for duplicate lines, or do you want to edit the file and remove duplicates?

If you want to use TC's Lister to search for duplicate lines, that's not possible without a Lister plugin. You could for example use CudaLister and search using a regular expression. You can't do the same with TC's internal Lister because regular expression search in TC is within each line.

If you want to edit the file, I suggest to use an editor. Notepad++ for example, has a a built-in function to remove duplicate lines and you can use various regular expression to search for duplicate lines, depending on your needs.

Sample regular expressions that work in Notepad++ and probably other editors:

Find first occurrence of a line which has a duplicate:

Code: Select all

(?-s)^(.+)$\R(?s)(?=.*?^\1$)

Same, but match the second occurrence:

Code: Select all

(?-s)^(.+)$(?s).*?\K^\1$\R?

Be careful with this last one. Know what you are doing or you could miss duplicate lines. This one is not suitable for a global replace or find all.

ZoSTeR · Post by *ZoSTeR » 2025-03-03, 15:17 UTC

I'm using NotePad4 for such tasks.

It offers the following options (Edit -> Lines -> Sort Lines..):

Sort / Don't Sort
Merge dupes, Remove dupes, Remove uniques

NotNull · Post by *NotNull » 2025-03-03, 22:48 UTC

Dalai wrote: 2025-03-03, 11:27 UTC Huh?

Had to read it 5 times, but I think this is the layout:

Code: Select all

abc

def

ghi
[...]

Windows has a Sort(.exe) command, too, but MS apparently didn't see the need to implement such a switch that prints out unique lines...

They did implement it (IIRC in Win7), but didn't bother to document it:

Code: Select all

sort /unique somefile.txt

white wrote: 2025-03-03, 12:27 UTC
Code: Select all
\R

Nice! Better than the \r?\n I use used. Thanks!
FWIW: my solution to find the first double: (?s)(^|\R)(.+)\R.*?\R\2(\R|$)
(don't know if this works in TC as it uses some quirky regex-dialect.

sitealpha wrote: 2025-03-03, 07:29 UTC Is there a way to search the file to find duplicate urls?

I would not use TC for this, but PowerShell:

Code: Select all

gc ".\yourfile.txt" | group | where {$_.Count -gt 1} | Select Count,NAme

Post by *white » 2025-03-03, 23:54 UTC

NotNull wrote: 2025-03-03, 22:48 UTC (don't know if this works in TC as it uses some quirky regex-dialect.

Like I said, regular expression search in TC is within each line. Als noted in the help:

The other modificators are not relevant for Total Commander, because the program only supports searching within one line.

Dalai · Post by *Dalai » 2025-03-04, 01:02 UTC

NotNull wrote: 2025-03-03, 22:48 UTCHad to read it 5 times, but I think this is the layout:
Code: Select all
abc

def

ghi
[...]

Ah, that would indeed fit the description.

They did implement it (IIRC in Win7), but didn't bother to document it:
Code: Select all
sort /unique somefile.txt

It looks like there's a lot MS doesn't document. This switch doesn't work on Win7 and 8.1 (prints "Invalid Option"), but it does work on Win10. Sort's documentation mentions a lot of abbreviations for its options (/L = /Locale, /R = /Reverse, etc.), but /U is not the same as /unique. The (full) documentation on SS64.com says that /u prints Unicode characters and /uniq is the same as /unique. Man, this software world is getting weirder every day...

Thanks anyway for mentioning this switch

.

sitealpha · Post by *sitealpha » 2025-03-04, 06:07 UTC

Thank you all for your help. I think am going to start with the Notepad++ and try the merge.
I have never used the sort command either. I'm sure one of these will the rid of all the duplicate lines. Thanks again

Total Commander

Search a text file for duplicate text

Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text

Re: Search a text file for duplicate text