Fuzzy matching with custom functions
The problem at hand
Matching multiple datasets based on one or more columns is a very commonly utilized procedure in data science. Usually, we are lucky enough to have a common unique identifier between these datasets.
One scenario in which I never come across perfect identifiers is with
faculty work, primarily datasets with student grades. These datasets
have columns like name
, email
and student_id
which may serve as
unique identifiers but sometimes they are not unique (e.g. same name) or
sometimes they fail as identifiers because of human error (e.g. typos).
One of the simple solutions for this problem is to do a
dplyr::full_join()
on the best unique identifier column, hope for the
best and then sort the unmatched rows by hand. This is cumbersome and
does not allow for smart automation (i.e. every time you have to do this
task it will take you arbitary amounts of time depending on dataset).
Fortunately, we can also use fuzzy matching to address human errors and match on multiple identifiers in order to make this workflow as robust as possible by using custom functions tailored to the specifics of each identifier. Fuzzy matching relies on string distance metrics to arrive at a decision where to match or not. Using multiple functions to arrive at a better match was inspired by the Intermediate Regular Expressions in R course on DataCamp.
String distance metrics
There are many flavors of string metrics. We generally distinguish between edit-based distances, q-gram-based distances, heuristic distances, kernel-based distances, and distances based on phonetic algorithms.
Phonetic Algorithms
Phonetic algorithms aim to match strings that sound similar when they are pronounced by transforming a string to some kind of phonetic string that represents pronunciation features. Most notable distances are Soundex and the powerful indexing system called Double Metaphone that replaced it and its successors.
These algorithms are implemented in the package {phonics} and probably several other. Unfortunately, the original algorithms were developed for standard English phonetics and are defined only for inputs over the standard English alphabet, i.e., “A-Z.”.
q-gram-based distances
Distances based on q-grams are computed by first deriving an integer vector from strings and defining a distance measure on the space of vectors. These distances are very useful with long text strings with big edit distances (e.g. sequence/paragraph exchange) where edit-based distances may become impractical.
This is not the case with names and other personal identifiers.
Notable edit-based distances
The Levenshtein distance is defined to be the minimum number of edit operations required to transform string a into string b, were edit operations are:
delete
insert
substitute
copy
The optimal string alignment distance, which is also referred to as the restricted Damerau-Levenshtein distance, allows for all of these edits and limited transpositions of adjacent characters.
Heuristic metrics
The Jaro distance was originally developed at the US bureau of the Census as part of an application called UNIMATCH. The purpose of this program was to match records based on inexact text fields, mostly name and address data.
Jaro-Winkler similarity measure is a variant of the Jaro metric based on empirical studies that fewer errors typically occur at the beginning of names.
Picking a metric
For our string-matching purposes (matching proper nouns or e-mail addresses) we will here only consider edit-based distances or heuristic metrics, although phonetic-based distances and q-gram-based distances may also be well suited for these data. More specifically, we will end up using the Jaro-Winkler distance to match names.
Example data
The hypothetical data was inspired by this python blog post.
Challenge | x | y |
---|---|---|
Phonetic Similarity | Klaus Werner | Claus Verner |
Missing Spaces & Hyphens | Peter Lower | PeterLower |
Missing Spaces & Hyphens | Anne Meier-King | Anne Meier King |
Missing Components | Peter James Low | Peter Low |
Abreviaitons | Jane Mona-King | Jane M. King |
Out of Order Components | Joe John Low | Low John Joe |
Nicknames | Wiiliam James | Bill James |
Multiple Languages | José Müller | Jose Muller |
Initials | James Earl Smith | J.E. Smith |
Typos | John Bowlby | Jonh Balby |
Not unique 1 | John James | John James |
Not unique 2 & typos | John James | John Jams |
The second column to match on that I usually have at my disposal is either a student ID or an institutional email. Both work virtually the same way and we can use the same string distance metrics even though emails contain mostly characters while IDs contain mostly digits. Typos are the most frequent challenge with these data.
Let’s take a look at some sample data. I have kept the name challeges from above and added an ID column with some errors.
df_x <-
tibble::tribble(
~name_x, ~id_x,
"Klaus Werner", 11991L,
"Peter Lower", 12621L,
"Anne Meier-King", 22517L,
"Peter James Low", 11212L,
"Jane Mona-King", 21930L,
"Joe John Low", 22860L,
"Wiiliam James", 22872L,
"José Müller", 21259L,
"James Earl Smith", 21314L,
"John Bowlby", 12481L,
"John James", 12298L,
"John James", 11024L
)
df_y <-
tibble::tribble(
~name_y, ~id_y,
"Claus Verner", 11199L,
"PeterLower", 13621L,
"Anne Meier King", 22517L,
"Peter Low", 112121L,
"Jane M. King", 21930L,
"Low John Joe", 2286L,
"Bill James", 22872L,
"Jose Muller", 22259L,
"J.E. Smith", 22314L,
"Jonh Balby", 12181L,
"John James", 12288L,
"John Jams", 91024L
)
Fuzzy matching on a single column
# Load packages
library(stringdist)
library(fuzzyjoin)
library(dplyr)
Here we will join the data frames on a maximum string distance of 2
using Optimal String Alignment (these are the defaults for {fuzzyjoin
}
functions like stringdist_join
).
When working with text data we usually do some preprocessing like
transforming all characters to lower case or excluding special symbols,
stripping white spaces etc.Here we don’t need to do preprocessing, we
can just set ignore_case = TRUE
in the join function of {fuzzyjoin
}.
## Joining only on names
joined <- fuzzyjoin::stringdist_full_join(
df_x,
df_y,
by = c("name_x" = "name_y"),
max_dist = 2, # default value
method = "osa", # default method
distance_col = "distance",
ignore_case = TRUE
)
# Print the number of rows of the newly created data frame
print(glue::glue(
"joining two dataframes with {x} and {y} rows resulted in a dataframe with {z} rows",
x = nrow(df_x),
y = nrow(df_y),
z = nrow(joined)
))
## joining two dataframes with 12 and 12 rows resulted in a dataframe with 20 rows
joined
## # A tibble: 20 x 5
## name_x id_x name_y id_y distance
## <chr> <int> <chr> <int> <dbl>
## 1 Klaus Werner 11991 Claus Verner 11199 2
## 2 Peter Lower 12621 PeterLower 13621 1
## 3 Peter Lower 12621 Peter Low 112121 2
## 4 Anne Meier-King 22517 Anne Meier King 22517 1
## 5 José Müller 21259 Jose Muller 22259 2
## 6 John James 12298 John James 12288 0
## 7 John James 12298 John Jams 91024 1
## 8 John James 11024 John James 12288 0
## 9 John James 11024 John Jams 91024 1
## 10 Peter James Low 11212 <NA> NA NA
## 11 Jane Mona-King 21930 <NA> NA NA
## 12 Joe John Low 22860 <NA> NA NA
## 13 Wiiliam James 22872 <NA> NA NA
## 14 James Earl Smith 21314 <NA> NA NA
## 15 John Bowlby 12481 <NA> NA NA
## 16 <NA> NA Jane M. King 21930 NA
## 17 <NA> NA Low John Joe 2286 NA
## 18 <NA> NA Bill James 22872 NA
## 19 <NA> NA J.E. Smith 22314 NA
## 20 <NA> NA Jonh Balby 12181 NA
We can see that this very permissive procedure made some incorrect
matches and also failed to match some cases.We could change the
max_dist
to another value to try to optimize the
sensitivity/specificity
trade-off,
but we won’t probably go very far with this data.
Finding matches based on two conditions
We will now try a more sophisticated approach: creating two custom helper functions that match entries that are similar or equal. We will use Jaro–Winkler distance to match names and Damerau-Levenshtein distance to match IDs.
For more info on distance metrics in the {fuzzyjoin
} or {stringdist
}
packages:
# Read more about methods
?stringdist::`stringdist-metrics`
# Helper function 1 - Calculate the string distance for names with JW method
is_name_distance_below_threshold <- function(left, right) {
stringdist::stringdist(left, right, method = "jw") < 1 # set max distance to 1
}
# Helper function 2 - Calculate the string distance for IDs with DL method
is_id_distance_below_threshold <- function(left, right) {
stringdist::stringdist(left, right, method = "dl") < 2 # set max distance to 2
}
# Join by name and id with our two helper functions
joined <- fuzzyjoin::fuzzy_full_join(
df_x, df_y,
by = list(x = c("name_x", "id_x"), y = c("name_y" ,"id_y")),
match_fun = list(is_name_distance_below_threshold, is_id_distance_below_threshold)
)
# Print the number of rows of the newly created data frame
print(glue::glue(
"joining two dataframes with {x} and {y} rows resulted in a dataframe with {z} rows",
x = nrow(df_x),
y = nrow(df_y),
z = nrow(joined)
))
## joining two dataframes with 12 and 12 rows resulted in a dataframe with 13 rows
joined
## # A tibble: 13 x 4
## name_x id_x name_y id_y
## <chr> <int> <chr> <int>
## 1 Peter Lower 12621 PeterLower 13621
## 2 Anne Meier-King 22517 Anne Meier King 22517
## 3 Peter James Low 11212 Peter Low 112121
## 4 Jane Mona-King 21930 Jane M. King 21930
## 5 Joe John Low 22860 Low John Joe 2286
## 6 Wiiliam James 22872 Bill James 22872
## 7 José Müller 21259 Jose Muller 22259
## 8 James Earl Smith 21314 J.E. Smith 22314
## 9 John Bowlby 12481 Jonh Balby 12181
## 10 John James 12298 John James 12288
## 11 John James 11024 John Jams 91024
## 12 Klaus Werner 11991 <NA> NA
## 13 <NA> NA Claus Verner 11199
Interestingly, only the phonetically similar name with very different ID remained unmatched.