Search a text file for duplicate text
Moderators: Hacker, petermad, Stefan2, white
Search a text file for duplicate text
I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?
Re: Search a text file for duplicate text
Huh? Lines are by definition separated by line breaks, usually CRLF or just LF (I'm ignoring other breaks such as FF on purpose).
With TC? Not that I'm aware of, because that would require not just a simple content analysis (does a file contain something) but a more sophisticated approach, relating lines to each other.Is there a way to search the file to find duplicate urls?
Here's what I would do: Sort the file and then let a dedicated tool do its job; maybe both steps can be done with a single tool. On Linux this can be done with a very simple and straightforward command:
Code: Select all
sort -u somefile.txt
Code: Select all
sort somefile.txt | uniq
Maybe Notepad++ with the TextFX plugin can help you. I have that installed and it can sort files, but I'm not sure about unique lines.
#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: Search a text file for duplicate text
What's the purpose? Do you only want to search for duplicate lines, or do you want to edit the file and remove duplicates?sitealpha wrote: 2025-03-03, 07:29 UTC I have a .txt file. Each line is a url with a space between lines. Is there a way to search the file to find duplicate urls?
If you want to use TC's Lister to search for duplicate lines, that's not possible without a Lister plugin. You could for example use CudaLister and search using a regular expression. You can't do the same with TC's internal Lister because regular expression search in TC is within each line.
If you want to edit the file, I suggest to use an editor. Notepad++ for example, has a a built-in function to remove duplicate lines and you can use various regular expression to search for duplicate lines, depending on your needs.
Sample regular expressions that work in Notepad++ and probably other editors:
Find first occurrence of a line which has a duplicate:
Code: Select all
(?-s)^(.+)$\R(?s)(?=.*?^\1$)
Code: Select all
(?-s)^(.+)$(?s).*?\K^\1$\R?
Re: Search a text file for duplicate text
I'm using NotePad4 for such tasks.
It offers the following options (Edit -> Lines -> Sort Lines..):
Sort / Don't Sort
Merge dupes, Remove dupes, Remove uniques
Re: Search a text file for duplicate text
Had to read it 5 times, but I think this is the layout:
Code: Select all
abc
def
ghi
[...]
They did implement it (IIRC in Win7), but didn't bother to document it:Windows has a Sort(.exe) command, too, but MS apparently didn't see the need to implement such a switch that prints out unique lines...
Code: Select all
sort /unique somefile.txt
Nice! Better than the \r?\n I use used. Thanks!
FWIW: my solution to find the first double: (?s)(^|\R)(.+)\R.*?\R\2(\R|$)
(don't know if this works in TC as it uses some quirky regex-dialect.
I would not use TC for this, but PowerShell:
Code: Select all
gc ".\yourfile.txt" | group | where {$_.Count -gt 1} | Select Count,NAme
Re: Search a text file for duplicate text
Like I said, regular expression search in TC is within each line. Als noted in the help:NotNull wrote: 2025-03-03, 22:48 UTC (don't know if this works in TC as it uses some quirky regex-dialect.
The other modificators are not relevant for Total Commander, because the program only supports searching within one line.
Re: Search a text file for duplicate text
Ah, that would indeed fit the description.NotNull wrote: 2025-03-03, 22:48 UTCHad to read it 5 times, but I think this is the layout:Code: Select all
abc def ghi [...]
It looks like there's a lot MS doesn't document. This switch doesn't work on Win7 and 8.1 (prints "Invalid Option"), but it does work on Win10. Sort's documentation mentions a lot of abbreviations for its options (/L = /Locale, /R = /Reverse, etc.), but /U is not the same as /unique. The (full) documentation on SS64.com says that /u prints Unicode characters and /uniq is the same as /unique. Man, this software world is getting weirder every day...They did implement it (IIRC in Win7), but didn't bother to document it:Code: Select all
sort /unique somefile.txt
Thanks anyway for mentioning this switch

#101164 Personal licence
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Ryzen 5 2600, 16 GiB RAM, ASUS Prime X370-A, Win7 x64
Plugins: Services2, Startups, CertificateInfo, SignatureInfo, LineBreakInfo - Download-Mirror
Re: Search a text file for duplicate text
Thank you all for your help. I think am going to start with the Notepad++ and try the merge.
I have never used the sort command either. I'm sure one of these will the rid of all the duplicate lines. Thanks again
I have never used the sort command either. I'm sure one of these will the rid of all the duplicate lines. Thanks again