Clash Royale CLAN TAG#URR8PPP
Fuzzy Address matching R
Yeah, it's been asked before, but I can't find a thread that provides a simple, clean answer to this question.
I have example data below - I have two columns, col1 is the current address, col2 is an address I am told is 'better' than the current address. I need to see how much 'better' the second column is over the first. Most of the time, the second is better b/c it contains secondary information that the first is lacking, such as apartment number.
test <- as.data.frame(matrix(c(
"742 Evergreen Terrace" , "742 Evergreen Terrace Apt 3" ,
"31 Spooner Street #42" , "31 Spooner Street",
"129 W 81st Street" , "129 W 81st Street Apt 5A" ,
"245 E 73rd Street", "245 E 73rd Street Apt 6") , ncol=2, byrow=TRUE,
dimnames=list(NULL, c("old_addr" , "new_addr"))) ,stringsAsFactors=FALSE)
There is an answer I found here that gets close to what I would like:
Fuzzy match row in one column with same row in next column
I need to create a third column that is a simple 1/0 variable that == 1 if it's an approximate match, and 0 if not. I need to be able to specify threshold for approximate matching.
For my first example - 742 Evergreen Terrace vs 742 Evergreen Terrace Apt 3, the length differs by six. I need to be able to specify a length difference of six, or eight, or whatever.
I looked at agrep, but I need to compare two columns data within the same row, and it does not allow for that. I have also tried lapply, but its results make me think it is cycling through all data in the entire column, and I need row by row comparisons. Also max distance I do not understand, with the ifelse below and a max of 1 (if I understand this correctly to be 1 == there can be one unit of edit or change), it should be throwing errors but it only does in one case.
agrep(test$old_addr, test$new_addr, max.distance = 0.1, ignore.case = TRUE)
test$fuzz_match <- lapply(test$old_addr , agrep , x =
test$new_addr , max.distance = 1 , ignore.case = TRUE)
Any help is appreciated, thank you!
1 Answer
1
You can calculate the Levenshtein distance between each pair. Then what you need to decide is how large must the distance be for the two not to be the same address.
test$lev_dist <- mapply(adist, test$old_addr, test$new_addr)
test$same_addr <- test$lev_dist < 5
test
# old_addr new_addr lev_dist same_addr
# 1 742 Evergreen Terrace 742 Evergreen Terrace Apt 3 6 FALSE
# 2 31 Spooner Street #42 31 Spooner Street 4 TRUE
# 3 129 W 81st Street 129 W 81st Street Apt 5A 7 FALSE
# 4 245 E 73rd Street 245 E 73rd Street Apt 6 6 FALSE
Of course you need to determine the threshold yourself, and there might be other difference measures than Levenshtein that might be a better option.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.