Search a text file for duplicate text

English support forum

Moderators: Hacker, petermad, Stefan2, white

Post Reply
sitealpha
Junior Member
Junior Member
Posts: 3
Joined: 2022-05-21, 12:33 UTC

Search a text file for duplicate text

Post by *sitealpha »

I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?
User avatar
Dalai
Power Member
Power Member
Posts: 9943
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Re: Search a text file for duplicate text

Post by *Dalai »

sitealpha wrote: 2025-03-03, 07:29 UTCEach line is a url with a space between lines.
Huh? Lines are by definition separated by line breaks, usually CRLF or just LF (I'm ignoring other breaks such as FF on purpose).
Is there a way to search the file to find duplicate urls?
With TC? Not that I'm aware of, because that would require not just a simple content analysis (does a file contain something) but a more sophisticated approach, relating lines to each other.

Here's what I would do: Sort the file and then let a dedicated tool do its job; maybe both steps can be done with a single tool. On Linux this can be done with a very simple and straightforward command:

Code: Select all

sort -u somefile.txt
If one really wants to use two commands in a pipe, it can be done like this

Code: Select all

sort somefile.txt | uniq
Windows has a Sort(.exe) command, too, but MS apparently didn't see the need to implement such a switch that prints out unique lines...

Maybe Notepad++ with the TextFX plugin can help you. I have that installed and it can sort files, but I'm not sure about unique lines.
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
User avatar
white
Power Member
Power Member
Posts: 5744
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Re: Search a text file for duplicate text

Post by *white »

sitealpha wrote: 2025-03-03, 07:29 UTC I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?
What's the purpose? Do you only want to search for duplicate lines, or do you want to edit the file and remove duplicates?

If you want to use TC's Lister to search for duplicate lines, that's not possible without a Lister plugin. You could for example use CudaLister and search using a regular expression. You can't do the same with TC's internal Lister because regular expression search in TC is within each line.

If you want to edit the file, I suggest to use an editor. Notepad++ for example, has a a built-in function to remove duplicate lines and you can use various regular expression to search for duplicate lines, depending on your needs.

Sample regular expressions that work in Notepad++ and probably other editors:

Find first occurrence of a line which has a duplicate:

Code: Select all

(?-s)^(.+)$\R(?s)(?=.*?^\1$)
Same, but match the second occurrence:

Code: Select all

(?-s)^(.+)$(?s).*?\K^\1$\R?
Be careful with this last one. Know what you are doing or you could miss duplicate lines. This one is not suitable for a global replace or find all.
User avatar
ZoSTeR
Power Member
Power Member
Posts: 1049
Joined: 2004-07-29, 11:00 UTC

Re: Search a text file for duplicate text

Post by *ZoSTeR »

 
I'm using NotePad4 for such tasks.

It offers the following options (Edit -> Lines -> Sort Lines..):

Sort / Don't Sort
Merge dupes, Remove dupes, Remove uniques
NotNull
Senior Member
Senior Member
Posts: 298
Joined: 2019-11-25, 20:43 UTC
Location: NL

Re: Search a text file for duplicate text

Post by *NotNull »

Dalai wrote: 2025-03-03, 11:27 UTC Huh?
Had to read it 5 times, but I think this is the layout:

Code: Select all

abc

def

ghi
[...]

Windows has a Sort(.exe) command, too, but MS apparently didn't see the need to implement such a switch that prints out unique lines...
They did implement it (IIRC in Win7), but didn't bother to document it:

Code: Select all

sort /unique somefile.txt

white wrote: 2025-03-03, 12:27 UTC

Code: Select all

\R
Nice! Better than the \r?\n I use used. Thanks!
FWIW: my solution to find the first double: (?s)(^|\R)(.+)\R.*?\R\2(\R|$)
(don't know if this works in TC as it uses some quirky regex-dialect.



sitealpha wrote: 2025-03-03, 07:29 UTC Is there a way to search the file to find duplicate urls?
I would not use TC for this, but PowerShell:

Code: Select all

gc ".\yourfile.txt" | group | where {$_.Count -gt 1} | Select Count,NAme
User avatar
white
Power Member
Power Member
Posts: 5744
Joined: 2003-11-19, 08:16 UTC
Location: Netherlands

Re: Search a text file for duplicate text

Post by *white »

NotNull wrote: 2025-03-03, 22:48 UTC (don't know if this works in TC as it uses some quirky regex-dialect.
Like I said, regular expression search in TC is within each line. Als noted in the help:
The other modificators are not relevant for Total Commander, because the program only supports searching within one line.
User avatar
Dalai
Power Member
Power Member
Posts: 9943
Joined: 2005-01-28, 22:17 UTC
Location: Meiningen (Südthüringen)

Re: Search a text file for duplicate text

Post by *Dalai »

NotNull wrote: 2025-03-03, 22:48 UTCHad to read it 5 times, but I think this is the layout:

Code: Select all

abc

def

ghi
[...]
Ah, that would indeed fit the description.
They did implement it (IIRC in Win7), but didn't bother to document it:

Code: Select all

sort /unique somefile.txt
It looks like there's a lot MS doesn't document. This switch doesn't work on Win7 and 8.1 (prints "Invalid Option"), but it does work on Win10. Sort's documentation mentions a lot of abbreviations for its options (/L = /Locale, /R = /Reverse, etc.), but /U is not the same as /unique. The (full) documentation on SS64.com says that /u prints Unicode characters and /uniq is the same as /unique. Man, this software world is getting weirder every day...

Thanks anyway for mentioning this switch :).
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64

Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
sitealpha
Junior Member
Junior Member
Posts: 3
Joined: 2022-05-21, 12:33 UTC

Re: Search a text file for duplicate text

Post by *sitealpha »

Thank you all for your help. I think am going to start with the Notepad++ and try the merge.
I have never used the sort command either. I'm sure one of these will the rid of all the duplicate lines. Thanks again
Post Reply