Bash: Remove unique and keep duplicate

Bash: Remove unique and keep duplicate

I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of

sort -u file.txt

After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use

shuf file.txt

The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.

I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.

Thank you.

superuser.com/a/1107659
– Dennis
yesterday

Bash has no built-in capabilities for sort. The sort command is provided by your operating system, and differs from system to system. Check man sort on your system to see what options are available. And for your particular problem, consider using a more advanced tool like awk or perl to handle complexities like splitting content by "column".
– ghoti
yesterday

sort

man sort

awk

perl

Redirect the results of your unique sort on col 15 to a temp file, the grep -vf temp original to remove the unique line from the original file. Check if your sort supports --key=KEYDEF and create a KEYDEF to sort on col 15.
– David C. Rankin
yesterday

grep -vf temp original

sort

--key=KEYDEF

KEYDEF

You can use uniq -d to get all the duplicated values in a sorted input stream.
– Barmar
yesterday

uniq -d

1 Answer
1

Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.

uniq -d

awk -F't' 'NR==FNR { dup[$0]; next; } $15 in dup' <(awk -F't' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt

awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.

awk '{print $15}' file.txt | sort | uniq -d

The NR==FNR line in the first awk script turns this into an associative array.

NR==FNR

awk

The second line processes file.txt and prints any lines where column 15 is in the array.

file.txt

Thank you! It seems to work, however one thing that isnt making sense to me. Just to check if all unique strings were removed, I ran sort -t$'t' -k15 -u file.txt > uniq and when I check the number of lines it gives me about 1300 more lines that are still included when I run your neat command.
– Vaxin
yesterday

-k15 should be -k15,15. Otherwise it means the key is all fields from 15 to the end.
– Barmar
yesterday

-k15

-k15,15

Okay! So now there are still 600 lines more sorted out as uniq when I run the sort command compared to your command. Would you have an idea why?
– Vaxin
yesterday

Can you post some sample input that produces the problem?
– Barmar
yesterday

Are there spaces in some of the fields? awk uses any whitespace as the field delimiter by default, use -F't' if it should only use tab.
– Barmar
yesterday

awk

-F't'

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

ojzdCRJP5T wf3q81KDedw3p8pVZJsR17NwPNmS435RKvG

搜尋此網誌

Xuykyuu