Mysql Matching “Same” Emails

Multi tool use
Multi tool use
The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Mysql Matching “Same” Emails



I have a table with 2 columns email and id. I need to find emails that are closely related. For example:


email


id


john.smith12@example.com



and


john.smith12@some.subdomains.example.com



These should be considered the same because the username (john.smith12) and the most top level domain (example.com) are the same. They are currently 2 different rows in my table. I've written the below expression which should do that comparison but it takes hours to execute (possibly/probably because of regex). Is there a better way to write this:


john.smith12


example.com


select c1.email, c2.email
from table as c1
join table as c2
on (
c1.leadid <> c2.leadid
and
c1.email regexp replace(replace(c2.email, '.', '[.]'), '@', '@[^@]*'))



The explain of this query comes back as:


id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, ALL, NULL, NULL, NULL, NULL, 577532, NULL
1, SIMPLE, c2, ALL, NULL, NULL, NULL, NULL, 577532, Using where; Using join buffer (Block Nested Loop)



The create table is:


CREATE TABLE `table` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Email` varchar(100) DEFAULT NULL,
KEY `Table_Email` (`Email`),
KEY `Email` (`Email`)
) ENGINE=InnoDB AUTO_INCREMENT=667020 DEFAULT CHARSET=latin1



I guess the indices aren't being used because of the regexp.



The regex comes out as:


john[.]smith12@[^@]*example[.]com



which should match both addresses.



Update:



I've modified the on to be:


on


on (c1.email <> '' and c2.email <> '' and c1.leadid <> c2.leadid and substr(c1. email, 1, (locate('@', c1.email) -1)) = substr(c2. email, 1, (locate('@', c2.email) -1))
and
substr(c1.email, locate('@', c1.email) + 1) like concat('%', substr(c2.email, locate('@', c2.email) + 1)))



and the explain with this approach is at least using the indices.


explain


id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, c1, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index
1, SIMPLE, c2, range, table_Email,Email, table_Email, 103, NULL, 288873, Using where; Using index; Using join buffer (Block Nested Loop)



So far this has executed for 5 minutes, will update if there is a vast improvement.





Related: stackoverflow.com/questions/12318083/…
– Barmar
yesterday





@Barmar That makes sense, I've updated to a non-regex but not seeing an improvement yet.
– user3783243
yesterday





% at the beginning of a LIKE pattern prevents it from using an index. You want the pattern to be john.smith@%
– Barmar
yesterday


%


LIKE


john.smith@%





Maybe you could use a generated column that holds a canonical version of the email.
– Barmar
yesterday





Yes, something like WHERE c1.email LIKE CONCAT(SUBSTR(c2.email, 1, POSITION(c2.email, '@')), '%') AND ...
– Barmar
yesterday




WHERE c1.email LIKE CONCAT(SUBSTR(c2.email, 1, POSITION(c2.email, '@')), '%') AND ...




1 Answer
1



No REGEXP_REPLACE needed, so it will work in all versions of MySQL/MariaDB:


REGEXP_REPLACE


UPDATE tbl
SET email = CONCAT(SUBSTRING_INDEX(email, '@', 1),
'@',
SUBSTRING_INDEX(
SUBSTRING_INDEX(email, '@', -1),
'.',
-2);



Since no index is useful, you may as well not bother with a WHERE clause.


WHERE





I thought the index would only be ignored if there were wildcards used on both sides of the expression. I thought with one % either left or right, it could still use an index. Is that incorrect? I can't make an edit but there a missing ) and it'll be -3 so we account for second level TLDs, like UK. CONCAT(SUBSTRING_INDEX(email, '@', 1), '@', SUBSTRING_INDEX( SUBSTRING_INDEX(email, '@', -1), '.', -3));
– user3783243
yesterday


%


)


-3


CONCAT(SUBSTRING_INDEX(email, '@', 1), '@', SUBSTRING_INDEX( SUBSTRING_INDEX(email, '@', -1), '.', -3));






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

nVm73ipiAEhLhOwkefTJGg,Due,wd qSG4J,g2,N xj joCFTOWRoYyEL4suI ewjQD2lTcRwMW,nhXlzB7 B
RMDcn5SGkPzzKXj,7w5Ca4ZbGyq7JL 0uEPppqhH83s7,YT nOF gd gC67bVWim9OB9zq2eszxiytizZ7B7,U0PfbL 5AF KygMbG3X

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

PHP parse/syntax errors; and how to solve them?

415 Unsupported Media Type while sending json file over REST Template