Bash: Remove unique and keep duplicate

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Bash: Remove unique and keep duplicate



I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of


sort -u file.txt



After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use


shuf file.txt



The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.



I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.



Thank you.





superuser.com/a/1107659
– Dennis
yesterday





Bash has no built-in capabilities for sort. The sort command is provided by your operating system, and differs from system to system. Check man sort on your system to see what options are available. And for your particular problem, consider using a more advanced tool like awk or perl to handle complexities like splitting content by "column".
– ghoti
yesterday


sort


man sort


awk


perl





Redirect the results of your unique sort on col 15 to a temp file, the grep -vf temp original to remove the unique line from the original file. Check if your sort supports --key=KEYDEF and create a KEYDEF to sort on col 15.
– David C. Rankin
yesterday




grep -vf temp original


sort


--key=KEYDEF


KEYDEF





You can use uniq -d to get all the duplicated values in a sorted input stream.
– Barmar
yesterday




uniq -d




1 Answer
1



Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.


uniq -d


awk -F't' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F't' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt



awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.


awk '{print $15}' file.txt | sort | uniq -d



The NR==FNR line in the first awk script turns this into an associative array.


NR==FNR


awk



The second line processes file.txt and prints any lines where column 15 is in the array.


file.txt





Thank you! It seems to work, however one thing that isnt making sense to me. Just to check if all unique strings were removed, I ran sort -t$'t' -k15 -u file.txt > uniq and when I check the number of lines it gives me about 1300 more lines that are still included when I run your neat command.
– Vaxin
yesterday







-k15 should be -k15,15. Otherwise it means the key is all fields from 15 to the end.
– Barmar
yesterday


-k15


-k15,15





Okay! So now there are still 600 lines more sorted out as uniq when I run the sort command compared to your command. Would you have an idea why?
– Vaxin
yesterday





Can you post some sample input that produces the problem?
– Barmar
yesterday





Are there spaces in some of the fields? awk uses any whitespace as the field delimiter by default, use -F't' if it should only use tab.
– Barmar
yesterday


awk


-F't'






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

ojzdCRJP5T wf3q81KDedw3p8pVZJsR17NwPNmS435RKvG
K0dlpFauE13VtOT ZTm

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

PHP parse/syntax errors; and how to solve them?

415 Unsupported Media Type while sending json file over REST Template