Bash: Remove unique and keep duplicate

Multi tool use


Bash: Remove unique and keep duplicate
I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of
sort -u file.txt
After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use
shuf file.txt
The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.
I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.
Thank you.
Bash has no built-in capabilities for sort. The
sort
command is provided by your operating system, and differs from system to system. Check man sort
on your system to see what options are available. And for your particular problem, consider using a more advanced tool like awk
or perl
to handle complexities like splitting content by "column".– ghoti
yesterday
sort
man sort
awk
perl
Redirect the results of your unique sort on col 15 to a temp file, the
grep -vf temp original
to remove the unique line from the original file. Check if your sort
supports --key=KEYDEF
and create a KEYDEF
to sort on col 15.– David C. Rankin
yesterday
grep -vf temp original
sort
--key=KEYDEF
KEYDEF
You can use
uniq -d
to get all the duplicated values in a sorted input stream.– Barmar
yesterday
uniq -d
1 Answer
1
Use uniq -d
to get a list of all the duplicate values, then filter the file so only those lines are included.
uniq -d
awk -F't' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F't' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt
awk '{print $15}' file.txt | sort | uniq -d
returns a list of all the duplicate values in column 15.
awk '{print $15}' file.txt | sort | uniq -d
The NR==FNR
line in the first awk
script turns this into an associative array.
NR==FNR
awk
The second line processes file.txt
and prints any lines where column 15 is in the array.
file.txt
Thank you! It seems to work, however one thing that isnt making sense to me. Just to check if all unique strings were removed, I ran sort -t$'t' -k15 -u file.txt > uniq and when I check the number of lines it gives me about 1300 more lines that are still included when I run your neat command.
– Vaxin
yesterday
-k15
should be -k15,15
. Otherwise it means the key is all fields from 15 to the end.– Barmar
yesterday
-k15
-k15,15
Okay! So now there are still 600 lines more sorted out as uniq when I run the sort command compared to your command. Would you have an idea why?
– Vaxin
yesterday
Can you post some sample input that produces the problem?
– Barmar
yesterday
Are there spaces in some of the fields?
awk
uses any whitespace as the field delimiter by default, use -F't'
if it should only use tab.– Barmar
yesterday
awk
-F't'
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
superuser.com/a/1107659
– Dennis
yesterday